Libraries

Libraries are imported by enclosing the name of the library with angled brackets.

/// main.csl

const math = @import_module("<math>");

fn distance(x0 : f16, y0 : f16, x1 : f16, x1 : f16) f16 {
  return math.sqrt((x0-x1)*(x0-x1) + (x0-x1)*(x0-x1));
}

<complex>

The complex library provides structs containing real and imag components and basic complex functions.

complex is a generic struct parameterized by its field type. The complex_32 and complex_64 non-generic names are also provided; these define a complex number using two f16 values and a complex number using two f32 values, respectively.

get_complex is a generic constructor that returns a complex struct based on the type of its inputs. The non-generic get_complex_32 and get_complex_64 constructor functions are provided as well:

// Returns struct {real: T, imag: T} where T can be f16 or f32
fn complex(comptime T: type) type
const complex_32 = complex(f16); // struct {real: f16, imag: f16}
const complex_64 = complex(f32); // struct {real: f32, imag: f32}

// Can operate on f16 or f32
fn get_complex(r: anytype, i: @type_of(r)) complex(@type_of(r))
fn get_complex_32(r : f16, i : f16) complex_32
fn get_complex_64(r : f32, i : f32) complex_64

The following functions are provided for operating on complex numbers. They are written as generic functions to facilitate use in other libraries or abstractions. In addition, non-generic complex_32 and complex_64 functions are provided. These functions have names suffixed with _32 and _64, respectively.

// x, y can be complex_32 or complex_64
fn add_complex(x: anytype, y: @type_of(x)) @type_of(x)
fn subtract_complex(x: anytype, y: @type_of(x)) @type_of(x)
fn multiply_complex(x: anytype, y: @type_of(x)) @type_of(x)

<debug>

The debug library provides a tracing mechanism to record tagged values.

// Record values of the specified type
fn trace_bool(x : bool) void
fn trace_u8(x : u8) void
fn trace_i8(x : i8) void
fn trace_u16(x : u16) void
fn trace_i16(x : i16) void
fn trace_f16(x : f16) void
fn trace_u32(x : u32) void
fn trace_i32(x : i32) void
fn trace_f32(x : f32) void

// Record a compile-time string
fn trace_string(comptime str : comptime_string) void

// Generic version
fn trace(x : anytype) void

// Record timestamp using the <time> library
fn trace_timestamp() void

// These functions are for internal use, recording raw words
fn tagged_put_u8(tag : u8, x : u8) void
fn put_u16(x : u16) void
fn put_u32(x : u32) void

A minimal example of a PE program recording timestamps and values using an imported instance of the <debug> library:

// pe_program.csl
// When importing an instance of the <debug> module, two things must be
// specified:
//
//   * (key: comptime_string) user-specified key that can be used to
//       retrieve the contents of the trace buffer after execution
//   * (buffer_size: comptime_int) size of buffer for recording traces
const trace = @import_module(
  "<debug>",
  .{ .key = "debug_example",
     .buffer_size = 100,
   }
);

var global : i16 = 0;

task main_task() void {
  // Record timestamp for beginning of task
  trace.trace_timestamp();

  // Record a compile-time string
  trace.trace_string("Hello, world");

  // Update global variable and record
  global = 5;
  trace.trace_i16(global);

  // Record timestamp for end of task
  trace.trace_timestamp();
}

<directions>

The directions library provides utility functions for manipulating directions.

fn rotate_clockwise(d : direction) direction
fn rotate_counterclockwise(d : direction) direction
fn flip_vertical(d : direction) direction
fn flip_horizontal(d : direction) direction
fn flip(d : direction) direction

<empty>

This library is empty on purpose. This allows a conditional module import as follows:

const cache = @import_module(if (stage == 0) "<empty>" else cache_name);

<layout>

This library provides access to information about where the PE is located. Specifically, the x and y coordinates in the rectangle can be accessed at runtime, allowing code to be shared between PEs at different locations.

const layout_module = @import_module("<layout>");

// Return the 0-indexed x and y coord
// Only supported on WSE-2
layout_module.get_x_coord() u16;
layout_module.get_y_coord() u16;

<malloc>

The malloc library implements an arena allocator using a statically allocated buffer.

In arena allocators, a single buffer (arena) is used to ensure that all objects are allocated sequentially in memory. Allocating and deallocating memory are fast operations, requiring an addition and/or assignment. The free operation frees all allocated objects at once.

The parameter buffer_num_words specifies the number of words of the statically allocated buffer.

If the param asserts_enabled is true, all allocations assert that the buffer has enough free memory. The default is false.

// specify buffer size
const mem = @import_module("<malloc>", .{buffer_num_words = <buffer_size>});

// mem provides the following API:

// returns pointer to num_values of type T
fn malloc(comptime T: type, num_values: u16) [*]T

// non-generic versions, return pointer to num_values of corresponding type
fn malloc_i16(num_values:u16) [*]i16
fn malloc_u16(num_values:u16) [*]u16
fn malloc_f16(num_values:u16) [*]f16
fn malloc_i32(num_values:u16) [*]i32
fn malloc_u32(num_values:u16) [*]u32
fn malloc_f32(num_values:u16) [*]f32

// returns true if an allocation of num_words elements of type T would
// succeed
fn has_enough_space(comptime T: type, num_values: u16) bool

// non-generic versions
// returns true if an allocation of num_words elements of type
// i16, u16, f16 would succeed:
fn has_enough_words(num_words:u16) bool
// returns true if an allocation of num_words elements of type
// i32, u32, f32 would succeed:
fn has_enough_double_words(num_double_words:u16) bool

// frees the entire buffer
fn free() void;

<math>

The math library functions are named using the convention <operationName>_<principalType>(). So for example the sin function over f32 values has the name sin_f32.

Math constants

The following can be used anywhere a floating point number is needed.

const PI : comptime_float
const E : comptime_float

Math functions

The <math> library provides standard mathematical functions. They are written as generic functions to facilitate use in other libraries or abstractions. In addition, non-generic f16 and f32 functions are provided. These functions have names suffixed with _f16 and _f32, respectively.

The following functions are provided:

// T can be f16 or f32
fn POSITIVE_INF(comptime T: type) : T
fn NEGATIVE_INF(comptime T: type) : T
fn NaN(comptime T: type) : T

// x can be f16, f32, i8, i16, i32, i64,
// u8, u16, u32, or u64
fn abs(x: anytype) @type_of(x)
fn max(x: anytype, y: @type_of(x)) @type_of(x)
fn min(x: anytype, y: @type_of(x)) @type_of(x)
fn sign(x: anytype) @type_of(x)

// x can be f16 or f32
fn ceil(x: anytype) @type_of(x)
fn cos(x: anytype) @type_of(x)
fn exp(x: anytype) @type_of(x)
fn floor(x: anytype) @type_of(x)
fn fscale(f: anytype, s: i16) @type_of(f)
fn inv(x: anytype) @type_of(x)
fn invsqrt(x: anytype) @type_of(x)
fn isNaN(x: anytype) bool
fn isInf(x: anytype) bool
fn isFinite(x: anytype) bool
fn isSignaling(x: anytype) bool
fn log(x: anytype) @type_of(x)
fn pow(x: anytype, y: @type_of(x)) @type_of(x)
fn sig(x: anytype) @type_of(x)
fn signbit(x: anytype) bool
fn sin(x: anytype) @type_of(x)
fn sqrt(x: anytype) @type_of(x)
fn tanh(x: anytype) @type_of(x)

Example

const math = @import_module("<math>");

var x: f16;

task t() void {
  if (!math.isFinite(x)) {
    x = 0.0;
  }
  var one = math.pow(math.sin(x), 2.0) + math.pow(math.cos(x), 2.0);
  if (math.abs(math.log(one) - 1.0) > 0.001) {
    x = math.NaN(f16);
  }
}

The same code can be written using non-generic functions:

task t() void {
  if (!math.isFinite_f16(x)) {
    x = 0.0;
  }
  var one = math.pow_f16(math.sin_f16(x), 2.0) +
    math.pow_f16(math.cos_f16(x), 2.0);
  if (math.abs_f16(math.log_f16(one) - 1.0) > 0.001) {
    x = math.NaN_f16;
  }
}

Note on sin and cos accuracy

Both f16 and f32 versions of sin and cos will produce incorrect results when abs(x) ≥ 16384π (approximately 51472).

<random>

The random library provides utility functions that wrap the @random16 builtin to create random values across various ranges and distributions.

See @random16 for information on the PRNG used by these functions.

// sets the global state of the PRNG number `prng_id` to `seed`
fn set_global_prng_seed(seed: u32) void

// generate a random 16-bit number in the range [lower, upper)
fn random_f16(lower: f16, upper: f16) f16

// generate a random 32-bit number in the ragne [lower, upper)
fn random_f32(lower: f32, upper: f32) f32

// generate a uniform random number in the range [0, 2^pow)
fn random_pow_u32(pow : u16) u32

// generate a normally distributed number using the Box-Muller transform
fn random_normal_f32() f32

<tile_config>

The tile_config library contains APIs relating to the hardware configuration of a PE. It contains the following top-level constants:

// The base addresses of memory-mapped registers
const addresses: enum(reg_type)
// The type of a word-sized memory-mapped register
const reg_type: type
const reg_ptr: type = *reg_type
// The type of a memory-mapped register occupying two words
const double_reg_type: type
// The name of the target architecture, such as "wse2"
const target_name: comptime_string
// The size of a word in bytes
const word_size: comptime_int

The tile_config library also contains an API to access the PE’s coordinates in the rectangle at runtime.

const fabric_coord: enum(reg_type) {
  X,
  Y
};
fn get_fabric_coord(dimension: fabric_coord) u16

filters

This submodule of tile_config contains APIs for configuring filters:

// The number of filters provided by the architecture.
const num_filters: comptime_int
// Set the active limit of a counter filter identified by `filter_id`
// to `limit`.
fn set_active_limit(filter_id: u16, limit: reg_type) void
// Set the maximum counter value of a counter filter identified by
// `filter_id` to `max_counter`.
fn set_max_counter(filter_id: u16, max_counter: reg_type) void
// Set the counter value of a counter filter identified by `filter_id`
// to `counter`.
fn set_counter(filter_id: u16, counter: reg_type) void

These functions can be used like:

const config = @import_module("<tile_config>");
// Set the counter of filter ID 1 to 0
config.filters.set_counter(1, 0);

teardown

This submodule of tile_config contains teardown APIs:

// Returns the task ID that is reserved for the teardown handler.
fn get_task_id() local_task_id
// Return the values of the "teardown-pending" registers combined into
// one value. Only the first invocation of this function per-task is
// guaranteed to return the correct value. Any additional calls per-task
// will have undefined results.
fn get_pending() double_reg_type
// Given a value that represents the "teardown-pending" state, which has 1
// bit per routable color indicating the ones that are currently in
// teardown state, return `true` iff the input color `c` is in teardown
// state.
fn is_pending(value: double_reg_type, c: color) bool
// Exit the teardown state for a given color `c`.
fn exit(c: color) void

These functions can be used like:

const config = @import_module("<tile_config>");
// Check if teardown is pending on color 8 or 9
var pendings = config.teardown.get_pending();
bool pending_8_or_9 = config.teardown.is_pending(pendings, @get_color(8)) or
    config.teardown.is_pending(pendings, @get_color(9));

task_priority

This submodule of tile_config contains APIs for configuring task priority:

// Enum for task priorities: either HIGH or LOW.
const level = enum(u16) {
  LOW = 0,
  HIGH = 1
};

// Updates the task priority associated with `task_id` to `priority`.
fn update_task_priority(task_id: anytype, priority: level) void
// Sets the task priority associated with `task_id` to high.
fn set_task_priority(task_id: anytype) void
// Sets the task priority associated with `task_id` to low.
fn clear_task_priority(task_id: anytype) void

The provided task_id can be a data_task_id or local_task_id to set the priority of the associated task.

In addition, the priority of tasks activated by wavelets, including tasks bound to a control_task_id, can be specified using the color on WSE-2, or the input_queue on WSE-3, that carries the wavelets.

Note that updates to task priority made at runtime may take a few clock cycles to take effect. These functions may be used at comptime or at runtime.

These functions can be used like:

const config = @import_module("<tile_config>");
const task_priority = config.task_priority;
const task_priority_level = task_priority.level;

param high_id: data_task_id;
param low_id: local_task_id;

comptime {
  // Equivalent to:
  //   task_priority.update_task_priority(
  //     high_id,
  //     task_priority_level.HIGH);
  task_priority.set_task_priority(high_id);
}

task main() void {
  // Equivalent to:
  //   task_priority.update_task_priority(
  //     low_id,
  //     task_priority_level.LOW);
  task_priority.clear_task_priority(low_id);
}

main_thread_priority

This submodule of tile_config contains APIs for configuring main thread priority. The main thread is the thread that executes non-async operations. Operations tagged with async execute on a microthread, which is associated with a fabric input or output queue. Main thread priority and microthread priority determine the relative scheduling priority of the threads.

// Enum for main thread priorities. The meanings of main thread priority
// levels are relative to microthread priorities, as follows:
//
//   MEDIUM_LOW: Between low- and medium-priority microthreads.
//   MEDIUM: Same priority as medium-priority microthreads.
//   MEDIUM_HIGH: Between medium- and high-priority microthreads.
//   HIGH: Same priority as high-priority microthreads.
const level = enum(u16) {
  MEDIUM_LOW = ...,
  MEDIUM = ...,
  MEDIUM_HIGH = ...,
  HIGH = ...
};

// Updates the priority for the main thread to `priority`. Note that updates
// to main thread priority made at runtime make take a few clock cycles to
// take effect. This function may be used at comptime or at runtime.
fn update_main_thread_priority(priority: level) void;

This function can be used like:

const config = @import_module("<tile_config>");
const mt_priority = config.main_thread_priority;

comptime {
  mt_priority.update_main_thread_priority(mt_priority.level.MEDIUM);
}

task main() void {
  mt_priority.update_main_thread_priority(mt_priority.level.MEDIUM_HIGH);
}

control_transform

This submodule of tile_config contains a function for setting the mask for transforming the index part of control wavelets. This function is to be used together with the DSD property control_transform to XOR the first six bits of the index portion of a wavelet with the specified mask.

fn set_mask(mask: reg_type) void

This function can be used like:

const tile_config = @import_module("<tile_config>");
const ctrl_xform = tile_config.control_transform;

var in_dsd = @get_dsd(fabin_dsd, .{ .fabric_color = recv_channel,
                                    .extent = 100,
                                    .input_queue = @get_input_queue(0),
                                    .control_transform = true });
const out_dsd = @get_dsd(fabout_dsd, .{ .extent = 100,
                                        .fabric_color = send_channel,
                                        .output_queue = @get_output_queue(1),
                                        .control_transform = true });

var buf = @zeros([5]u32);
const fifo = @allocate_fifo(buf);

task buffer() void {
  @mov32(fifo, in_dsd, .{ .async = true });
  @mov32(out_dsd, fifo, .{ .async = true });
}

comptime {
  ctrl_xform.set_mask(2);
}

The set_mask function can be used either at comptime or runtime. Only the first six bits of the mask are taken into account.

exceptions

This submodule of tile_config contains functions for setting values in the exception mask register. The exception mask register determines which exceptions cause the processor to stop. An unmasked exception causes the processor to immediately stop execution. A masked exception allows execution to continue. By default, all exceptions are masked. The functions in this submodule can be used to unmask them.

// Exceptions which can be unmasked with the below functions
const PERF_CNT_0_OVERFLOW = ...;
const PERF_CNT_1_OVERFLOW = ...;
const SW_EXCEPTION = ...;
const FP_UNDERFLOW = ...;
const FP_OVERFLOW = ...;
const FP_INEXACT = ...;
const FP_INVALID = ...;
const FP_DIV_BY_0 = ...;

// Unmask one of the exceptions above at comptime
fn set_exception_mask_comptime(comptime exception_mask: reg_type) void;

// Unmask one of the exceptions above at runtime
fn set_exception_mask(exception_mask: reg_type) void;

This submodule can be used as follows:

const tile_config = @import_module("<tile_config>");
const exceptions = tile_config.exceptions;

fn fp_div_by_0() f32 {
  // Set exception mask for FP_DIV_BY_0.
  // When floating point divide by zero occurs,
  // processor will stop execution.
  exceptions.set_exception_mask(exceptions.FP_DIV_BY_0);

  var x : f32 = 42.0;
  var y : f32 = 0.0;

  // This operation is a divide by zero, so processor should hang
  return x / y;
}

Each call to set_exception_mask overwrites the exception mask register. Multiple exceptions can be unmasked simultaneously as follows:

// FP_DIV_BY_0 is unmasked.
exceptions.set_exception_mask(exceptions.FP_DIV_BY_0);

// FP_DIV_BY_0 is masked again.
// FP_OVERFLOW and FP_UNDERFLOW are now unmasked.
exceptions.set_exception_mask(exceptions.FP_OVERFLOW
                            & exceptions.FP_UNDERFLOW);

color_config

This submodule of tile_config contains APIs and an enum type for changing the configuration of a given color during a teardown phase.

First of all, the color_config submodule defines the following enum type:

const fabric_io = enum(u16) {
  TX_WEST,
  TX_EAST,
  TX_SOUTH,
  TX_NORTH,
  TX_RAMP,

  RX_WEST,
  RX_EAST,
  RX_SOUTH,
  RX_NORTH,
  RX_RAMP
};

This enum consists of all the input and output routing directions which can be used to specify the routing direction we wish to modify.

Specifically, the color_config library consists of the following functions:

// Returns the word address of a color configuration that corresponds to
// color `c`.
fn get_color_config_addr(c: color) reg_type
// Enables `dir` direction for a color `c` or a word address `c`
// of a color configuration register.
fn set_io_direction(c: anytype, dir: fabric_io) void
// For a given color `c` or a word address `c` of a color configuration
// register, toggle the I/O direction `dir`.
fn toggle_io_direction(c: anytype, dir: fabric_io) void
// For a given color `c` or a word address `c` of a color configuration
// register, clear the setting for `dir`.
fn clear_io_direction(c: anytype, dir: fabric_io) void
// For a given color `c` or a word address `c` of a color configuration
// register, clear all I/O routes and reset them according to `new_routes`.
fn reset_routes(c: anytype, comptime new_routes: comptime_struct) void

These functions can be used as follows:

param red: color;
var blue: color;
const tile_config = @import_module("<tile_config>");
const color_config = tile_config.color_config;
const red_addr = color_config.get_color_config_addr(red);

task teardown() void {
  // Color `blue` is a `var` and therefore not known until runtime.
  const addr = color_config.get_color_config_addr(blue);

  // We can manipulate the I/O routing configuration for a
  // comptime-known color, a runtime color or an address to a
  // color configuration register.
  const offramp = color_config.fabric_io.TX_RAMP;
  const onramp = color_config.fabric_io.RX_RAMP;
  const rx_north = color_config.fabric_io.RX_NORTH;
  color_config.set_io_direction(red, offramp);
  color_config.set_io_direction(blue, onramp);
  color_config.set_io_direction(addr, rx_north);

  // Reset the routes for color `red`.
  color_config.reset_routes(red, .{.tx = NORH, .rx = RAMP});
}

switch_config

This submodule of tile_config contains APIs and enum types that can be used to change the switch configuration of a given color during a teardown phase.

First of all, the switch_config submodule defines the following enum types:

const pop_mode = enum(u16) {
  NO_POP,
  ALWAYS_POP,
  POP_ON_ADVANCE,

  CLEAR_MASK
};

const switch_status = enum(u16) {
  CLEAR_CURRENT_POS
};

These enum types represent specific setting categories like pop mode and switch status and they can be used to specify the settings that we want to modify in a per-category manner.

In addition, the color_config submodule consists of the following functions:

// Returns the word address of a given color `c` and switch setting type
// `setting_ty`.
fn get_switch_config_addr(c: color, comptime setting_ty: type) reg_type
// For a given color `c` or a word address `c` of a color configuration
// register, clear its current switch position.
fn clear_current_position(c: anytype) void
// For a given color `c` or a word address `c` of a color configuration
// register, set the given pop mode `mode` after clearing the previous one.
fn set_pop_mode(c: anytype, mode: pop_mode) void

These functions can be used as follows:

param red: color;
var blue: color;
const tile_config = @import_module("<tile_config>");
const switch_config = tile_config.switch_config;
const switch_status_addr =
      switch_config.get_switch_config_addr(red,
                                          switch_config.switch_status);

task teardown() void {
  // Color `blue` is a `var` and therefore not known until runtime.
  const addr = switch_config.get_switch_config_addr(blue,
                                                    switch_config.pop_mode);

  // We can manipulate the switch configuration using a
  // comptime-known color, a runtime color or an address to
  // a color configuration register.
  switch_config.clear_current_position(red);
  switch_config.clear_current_position(blue);
  switch_config.clear_current_position(switch_status_addr);

  // Reset the pop mode to a new setting.
  switch_config.set_pop_mode(red, switch_config.pop_mode.ALWAYS_POP);
}

<time>

The time library returns the current 48-bit timestamp counter as three 16-bit unsigned integers in little endian form.

// enable tsc registers for capturing timestamps
fn enable_tsc() void;

// disable tsc registers for capturing timestamps
fn disable_tsc() void;

// write timestamp to array of three u16 values
fn get_timestamp(result : *[3]u16) void;

// reset tsc register to 0
fn reset_tsc_counter() void;

<kernels>

This library differs from all other libraries in that it provides kernels, as opposed to individual functions. The “tally” kernel implements a two-phase tally, used to coordinate the work done by multiple PEs. It is documented in the kernel code itself.

<tally>

The tally library implements a two-phase tally kernel that allows PEs within a rectangle to communicate progress/completion to the host.

The library consists of two modules:

  1. <kernels/tally/layout>: imported once and use in the layout block to parameterize each PE’s tally behavior.

  2. <kernels/tally/pe>: imported once by each PE, consuming the parameters generated by the layout module.

A minimal example of importing and using both modules, starting with the layout module:

// code.csl

const tally = @import_module("<kernels/tally/layout>", .{
  .kernel_height=8,
  .kernel_width=4,
  .phase2_tally=0,
  .colors=[3]color{@get_color(1), @get_color(2), @get_color(3)},
  .output_color=@get_color(0),
});

layout {
  @set_rectangle(4, 8);

  for (@range(u16, 4)) |i| {
    for (@range(u16, 8)) |j| {
      @set_tile_code(i, j, "pe.csl", .{
        .tally_params = tally.get_params(i, j),
      });
    }
  }
}

And the per-PE module:

// pe.csl

param tally_params: comptime_struct;

// On WSE-2, input_queues and output_queues can be the same.
// On WSE-3, they must be different.
const tally = @import_module("<kernels/tally/pe>",
  @concat_structs(tally_params, .{
    .input_queues=[2]u16{0, 1},
    .output_queues=[2]u16{0, 1},
  }));

task done() void {
  tally.signal_completion();
}

...

The tally kernel operates in two phases.

In the first phase, every PE must signal completion at least once. For kernels where each PE knows when it is finished, this is the only phase needed.

The first phase ends when every PE has signaled completion at least once. During the second phase, PEs can bump (increase) the global tally. When the global tally meets or exceeds the phase2_tally parameter, the kernel signals completion by sending the total to the North on output_color from the PE at (kernel_width - 1, 0).

The second phase is optional. If phase2_tally == 0, the second phase will be skipped and the output signal on output_color will be 0.

<collectives_2d>

This library implements collective communication directives that allows PEs to communicate data with one another.

The library consists of two modules:

  1. <collectives_2d/params>: Imported once to parameterize each PE in the layout block.

  2. <collectives_2d/pe>: Imported once per dimension per PE. Contains collective communication directives for a single axis.

<collectives_2d/params>

The parameter module exposes a compile-time helper function for configuring PEs to use <collectives_2d>

fn get_params(Px: u16, Py: u16, ids: comptime_struct) comptime_struct
  • Px is the PE’s x-coordinate.

  • Py is the PE’s y-coordinate.

  • ids is a struct that is expected to have either the x-related fields, the y-related fields, or all four, of the following:

    • x_colors: a struct containing 2 distinct colors as anonymous fields

    • x_entrypoints: a struct containing 2 distinct local task IDs as anonymous fields

    • y_colors: a struct containing 2 distinct colors as anonymous fields

    • y_entrypoints: a struct containing 2 distinct local task IDs as anonymous fields

  • Returns a struct containing the parameters necessary to import library modules for the specified PE. This struct contains:

    • x: an opaque struct containing parameters needed to configure collective communications in the x-dimension.

    • y: an opaque struct containing parameters needed to configure collective communications in the y-dimension.

<collectives_2d/pe>

The following directives are currently supported:

fn init() void
fn broadcast(root: u16, buf: [*]u32, count: u16, callback: local_task_id) void
fn scatter(root: u16, send_buf: [*]u32, recv_buf: [*]u32, count: u16,
           callback: local_task_id) void
fn gather(root: u16, send_buf: [*]u32, recv_buf: [*]u32, count: u16,
          callback: local_task_id) void
fn reduce_fadds(root: u16, send_buf: [*]f32, recv_buf: [*]f32, count: u16,
                callback: local_task_id) void

init initializes the library. It must be invoked for each axis.

broadcast transmits the contents of buf from the root PE to the buf of other PEs in the row or column. count should be the length of buf. It is akin to MPI_Bcast.

scatter transmits count-many elements from send_buf from the root PE to the recv_buf of other PEs in the row/column. It is akin to MPI_Scatter.

gather accumulates count-many elements from send_buf of other PEs into the recv_buf of the root PE. It is akin to MPI_Gather.

When distributing or aggregating elements using scatter or gather for N PEs, the send_buf or recv_buf should have space for count * N elements, respectively.

reduce_fadds computes an MPI_Sum for buffers of f32.

In general, all PEs must call the same directive with same root and count. The primitives have the following common parameters:

  • root is the root PE for network configuration,

  • send_buf is a buffer containing data to be transmitted,

  • recv_buf is a buffer for holding data received,

  • count is the number of elements to be transmitted,

  • callback is activated when the primitive completes.

The user can configure the resources of collectives_2d. Each imported module must be assigned queue IDs (queues) and DSR IDs (dest_dsr_ids, src0_dsr_ids, src1_dsr_ids). If the user does not specify these parameters explicitly, the default values apply. The following example shows the default values of queue IDs and DSR IDs of collectives_2d.

A minimal example that sets up PEs to broadcast 10 elements from the root PE to every other PE in the row/column consists of the following layout code:

// code.csl

param width: u16;
param height: u16;
param root: u16;

const c2d = @import_module("<collectives_2d/params>");

layout {
  @set_rectangle(width, height);

  var x: u16 = 0;
  while (x < width) : (x += 1) {
    var y: u16 = 0;
    while (y < height) : (y += 1) {
      const c2d_params = c2d.get_params(x, y, .{
        .x_colors = .{
          @get_color(0),
          @get_color(1)
        },
        .x_entrypoints = .{
          @get_local_task_id(2),
          @get_local_task_id(3)
        },
        .y_colors = .{
          @get_color(4),
          @get_color(5)
        },
        .y_entrypoints = .{
          @get_local_task_id(6),
          @get_local_task_id(7)
        },
      });
      @set_tile_code(
        x,
        y,
        "pe.csl",
        .{ .root = root, .c2d_params = c2d_params }
      );
    }
  }
}

And the per-PE module:

// pe.csl

param c2d_params: comptime_struct;

const rect_height = @get_rectangle().height;
const rect_width = @get_rectangle().width;

// Pick two task IDs not used in the library for callbacks
const x_task_id = @get_local_task_id(15);
const y_task_id = @get_local_task_id(16);

const len = 10;
var x_data = @zeros([len]u32);
var y_data = @zeros([len]u32);

const mpi_x = @import_module(
  "<collectives_2d/pe>",
  .{ .dim_params = c2d_params.x,
     .queues = [2]u16{2,4},
     .dest_dsr_ids = [1]u16{1},
     .src0_dsr_ids = [1]u16{1},
     .src1_dsr_ids = [1]u16{1}
   }
);

const mpi_y = @import_module(
  "<collectives_2d/pe>",
  .{ .dim_params = c2d_params.y,
     .queues = [2]u16{3,5},
     .dest_dsr_ids = [1]u16{2},
     .src0_dsr_ids = [1]u16{2},
     .src1_dsr_ids = [1]u16{2}
   }
);

task x_task() void {
  var send_buf = @ptrcast([*]u32, &x_data);
  var recv_buf = @ptrcast([*]u32, &@zeros[len]u32);

  if (root == mpi_x.pe_id) {
    mpi_x.broadcast(root, send_buf, len, x_task_id);
  } else {
    mpi_x.broadcast(root, recv_buf, len, x_task_id);
  }
}

task y_task() void {
  var send_buf = @ptrcast([*]u32, &y_data);
  var recv_buf = @ptrcast([*]u32, &@zeros[len]u32);

  if (root == mpi_y.pe_id) {
    mpi_y.broadcast(root, send_buf, len, y_task_id);
  } else {
    mpi_y.broadcast(root, recv_buf, len, y_task_id);
  }
}