Libraries

Libraries are imported by enclosing the name of the library with angled brackets.

/// main.csl

const math = @import_module("<math>");

fn distance(x0 : f16, y0 : f16, x1 : f16, x1 : f16) f16 {
  return math.sqrt((x0-x1)*(x0-x1) + (x0-x1)*(x0-x1));
}

<complex>

The complex library provides structs containing real and imag components and basic complex functions.

complex is a generic struct parameterized by its field type. The complex_32 and complex_64 non-generic names are also provided; these define a complex number using two f16 values and a complex number using two f32 values, respectively.

get_complex is a generic constructor that returns a complex struct based on the type of its inputs. The non-generic get_complex_32 and get_complex_64 constructor functions are provided as well:

// Returns struct {real: T, imag: T} where T can be f16 or f32
fn complex(comptime T: type) type
const complex_32 = complex(f16); // struct {real: f16, imag: f16}
const complex_64 = complex(f32); // struct {real: f32, imag: f32}

// Can operate on f16 or f32
fn get_complex(r: anytype, i: @type_of(r)) complex(@type_of(r))
fn get_complex_32(r : f16, i : f16) complex_32
fn get_complex_64(r : f32, i : f32) complex_64

The following functions are provided for operating on complex numbers. They are written as generic functions to facilitate use in other libraries or abstractions. In addition, non-generic complex_32 and complex_64 functions are provided. These functions have names suffixed with _32 and _64, respectively.

// x, y can be complex_32 or complex_64
fn add_complex(x: anytype, y: @type_of(x)) @type_of(x)
fn subtract_complex(x: anytype, y: @type_of(x)) @type_of(x)
fn multiply_complex(x: anytype, y: @type_of(x)) @type_of(x)

<control>

The control library provides utilities for constructing control wavelets.

The following functions and enums are provided by the library:

// Max commands that can be encoded in a control wavelet
const MAX_CMDS = 8;

// Struct for representing switching opcodes
const opcode = enum(u32) {
  NOP = 0,
  SWITCH_ADV = 1,
  SWITCH_RST = 2,
  TEARDOWN = 3
};

// Encode payload that activates a control task with no argument
fn encode_control_task_payload(entrypoint: control_task_id) u32;

// Encode payload with one switch command, plus control task entrypoint.
fn encode_single_payload(cmd: opcode, ce_ignore: bool,
                         comptime entrypoint: control_task_id, data: u16) u32;

// Encode general control wavelet payload
fn encode_payload(comptime N: u16, comptime cmds: [N]opcode,
                  comptime ce_ignore: [N]bool,
                  ce_ignore_remaining: bool,
                  comptime entrypoint: anytype) u32;

All functions construct a payload returned as a 32-bit unsigned integer which can be sent in a control wavelet.

encode_control_task_payload returns a control wavelet payload which activates a control task on all receiving PEs. It has one argument:

  • entrypoint: a control_task_id which is bound to the control task activated on a CE by the receipt of this wavelet.

encode_single_payload returns a control wavelet payload containing one switch command, along with an optional control task entrypoint with 16-bit data argument. The function has the following arguments:

  • cmd: a switching opcode to be consumed by the receiving PE router. This command will instruct the router to modify the configuration of the color on which the control wavelet is sent. This command can advance the switch position, reset the switch position, teardown the color, or do nothing. If the router of the PE on which the control wavelet is sent pops this command, then no additional receiving PEs will receive a switching opcode.

  • ce_ignore: a boolean which determines whether this control wavelet is to be ignored by the CE of PEs which receive it. If true, the control wavelet will not be forwarded to the CE. If false, and the receiving color is configured to transmit down the RAMP, the control wavelet will be forwarded to the CE. ce_ignore must be false for an entrypoint to be activated by a receiving PE.

  • entrypoint: a control_task_id will be activated on a CE by the receipt of this wavelet. Passing {} indicates that no control task activation on receiving PEs is desired. The control task will only be activated on a CE if ce_ignore is false, and the receiving color is configured to transmit down the RAMP.

  • data: The control task activated by entrypoint may take a single 16-bit argument. If the control task takes no argument, then this value will be ignored.

encode_payload can encode a general control wavelet payload with up to eight switching commands. The function has the following arguments:

  • N: number of commands to encode in the control wavelet. Maximum number of commands is eight.

  • cmd: an array of switching opcodes to be consumed by PE routers. Each command will instruct the router to modify the configuration of the color on which the control wavelet is sent. Each command can advance the switch position, reset the switch position, teardown the color, or do nothing. If the router of the PE on which a command is executed pops the command, then the next command will be executed by the next receiving router.

  • ce_ignore: an array of booleans which determines whether this control wavelet is to be ignored by the CE of PEs which receive it. Each ce_ignore value is processed along with the associated cmd, i.e., the same rules for popping commands apply. If the processed value is true, the control wavelet will not be forwarded to the CE. If false, and the receiving color is configured to transmit down the RAMP, the control wavelet will be forwarded to the CE. ce_ignore must be false for an entrypoint to be activated by a receiving PE.

  • ce_ignore_remaining: a boolean which determines whether all other commands contained in this control wavelet are to be ignored by the CE of PEs receiving it. When ce_ignore_remaining is set to false, each unspecified command will travel down the RAMP and reach the CE (as a NOP command).

  • entrypoint: a control_task_id which is bound to the control task activated on a CE by the receipt of this wavelet. Passing {} indicates that no control task activation on receiving PEs is desired. The control task will only be activated on a CE if ce_ignore is false, and the receiving color is configured to transmit down the RAMP. Because this function can encode up to eight switching commands, no data payload can be provided for this control task.

Unlike encode_single_payload, encode_payload does not take a data argument. If a control payload only contains a single switching command, then a 16-bit data argument can be supplied as an argument to the control task activated on receipt of the wavelet. data is not meaningful if there is more than one switching command in the control wavelet, because the bits that would encode data encode the additional switching commands instead.

A control task that declares no arguments will ignore data, and furthermore, data is ignored if the wavelet is not forwarded to the CE (the current command’s ce_ignore value is true).

Example

The task main_task sends out a control wavelet along the color comm, which encodes a control task ID:

const ctrl = @import_module("<control>");

const comm = @get_color(0);
const comm_out_queue = @get_output_queue(2);
const ctrl_entrypt_id = @get_control_task_id(40);

task main_task() void {
  const comm_out_dsd = @get_dsd(fabout_dsd, .{
                         .extent = 1,
                         .fabric_color = comm,
                         .control = true,
                         .output_queue = comm_out_queue,
                       });
  @mov32(comm_out_dsd, ctrl.encode_control_task_payload(ctrl_entrypt_id));
}

PEs which receive this wavelet along the color comm will activate a control task bound to this control task ID. For instance, if the receiving PE has the following code, then upon receipt of the control wavelet, it will activate a task which increments the value my_global:

const my_ctrl_id = @get_control_task_id(40);
var my_global: u32 = 0;

task my_ctrl_task() void {
  my_global += 1;
}

comptime {
  @bind_control_task(my_ctrl_task, my_ctrl_id);
}

<debug>

The debug library provides a tracing mechanism to record tagged values.

// Record values of the specified type
fn trace_bool(x : bool) void
fn trace_u8(x : u8) void
fn trace_i8(x : i8) void
fn trace_u16(x : u16) void
fn trace_i16(x : i16) void
fn trace_f16(x : f16) void
fn trace_u32(x : u32) void
fn trace_i32(x : i32) void
fn trace_f32(x : f32) void

// Record a compile-time string
fn trace_string(comptime str : comptime_string) void

// Generic version
fn trace(x : anytype) void

// Record timestamp using the <time> library
fn trace_timestamp() void

// These functions are for internal use, recording raw words
fn tagged_put_u8(tag : u8, x : u8) void
fn put_u16(x : u16) void
fn put_u32(x : u32) void

A minimal example of a PE program recording timestamps and values using an imported instance of the <debug> library:

// pe_program.csl
// When importing an instance of the <debug> module, two things must be
// specified:
//
//   * (key: comptime_string) user-specified key that can be used to
//       retrieve the contents of the trace buffer after execution
//   * (buffer_size: comptime_int) size of buffer for recording traces
const trace = @import_module(
  "<debug>",
  .{ .key = "debug_example",
     .buffer_size = 100,
   }
);

var global : i16 = 0;

task main_task() void {
  // Record timestamp for beginning of task
  trace.trace_timestamp();

  // Record a compile-time string
  trace.trace_string("Hello, world");

  // Update global variable and record
  global = 5;
  trace.trace_i16(global);

  // Record timestamp for end of task
  trace.trace_timestamp();
}

<directions>

The directions library provides utility functions for manipulating directions.

fn rotate_clockwise(d : direction) direction
fn rotate_counterclockwise(d : direction) direction
fn flip_vertical(d : direction) direction
fn flip_horizontal(d : direction) direction
fn flip(d : direction) direction

<dsd_ops>

The dsd_ops library provides wrappers around DSD op builtins that select an appropriate builtin depending on argument indicating the types of the underlying data. These wrappers are guaranteed to expand to a single call to a DSD op builtin. The wrappers may be used with any combination of DSD, DSR, scalar, or pointer-to-scalar operands that is supported by the underlying builtin operation.

Each function operates on a limited set of types. For DSD operations, the programmer must ensure that the specified type accurately reflects the type of the data being accessed in memory or streamed via the DSD.

The final argument, named config, is a configuration struct for the underlying DSD op builtin. See Builtins for more details on the builtins underlying these functions.

Note that the config argument must be completely comptime-known. This means that runtime .activate or .unblock values are not allowed with these wrapper functions. We hope to lift this limitation in a future release.

// Data movement.
//
//   T  Resulting Operation
// ---  -------------------
// f16  @fmovh(dst, src, config)
// f32  @fmovs(dst, src, config)
// i16  @mov16(dst, src, config)
// u16  @mov16(dst, src, config)
// i32  @mov32(dst, src, config)
// u32  @mov32(dst, src, config)
inline fn mov(T: type, dst: anytype, src: anytype,
              comptime config: anytype) bool;

// Conversion between data types.
//
// Tdst  Tsrc  Resulting Operation
// ----  ----  -------------------
//  f16   f32  @fs2h(dst, src, config)
//  f16   i16  @xp162fh(dst, src, config)
//  f16   f16  @fmovh(dst, src, config)
//  f32   f16  @fh2s(dst, src, config)
//  f32   i16  @xp162fs(dst, src, config)
//  f32   f32  @fmovs(dst, src, config)
//  i16   f16  @fh2xp16(dst, src, config)
//  i16   f32  @fs2xp16(dst, src, config)
//  i16   i16  @mov16(dst, src, config)
inline fn convert(Tdst: type, Tsrc: type,
                  dst: anytype, src: anytype, comptime config: anytype) bool;

// Addition.
//
// Tdst  Tsrc  Resulting Operation
// ----  ----  -------------------
//  f16   f16  @faddh(dst, src0, src1, config)
//  f32   f16  @faddhs(dst, src0, src1, config)
//  f32   f32  @fadds(dst, src0, src1, config)
//  i16   i16  @add16(dst, src0, src1, config)
//  u16   u16  @add16(dst, src0, src1, config)
inline fn add(Tdst: type, Tsrc: type,
              dst: anytype, src0: anytype, src1: anytype,
              comptime config: anytype) bool;

// Subtraction.
//
//    T  Resulting Operation
// ----  -------------------
//  f16  @fsubh(dst, src0, src1, config)
//  f32  @fsubs(dst, src0, src1, config)
//  i16  @sub16(dst, src0, src1, config)
inline fn sub(T: type, dst: anytype, src0: anytype, src1: anytype,
              comptime config: anytype) bool;

// Multiplication.
//
//    T  Resulting Operation
// ----  -------------------
//  f16  @fmulh(dst, src0, src1, config)
//  f32  @fmuls(dst, src0, src1, config)
inline fn mul(T: type, dst: anytype, src0: anytype, src1: anytype,
              comptime config: anytype) bool;

// Fused multiply and accumulate.
//
// Tdst  Tsrc  Resulting Operation
// ----  ----  -------------------
//  f16   f16  @fmach(dst, src0, src1, x, config)
//  f32   f16  @fmachs(dst, src0, src1, x, config)
//  f32   f32  @fmacs(dst, src0, src1, x, config)
inline fn fmac(Tdst: type, Tsrc: type, dst: anytype,
               src0: anytype, src1: anytype, x: anytype,
               comptime config: anytype) bool;

// Arithmetic negation.
//
//   T  Resulting Operation
// ---  -------------------
// f16  @fnegh(dst, src, config)
// f32  @fnegs(dst, src, config)
inline fn neg(T: type, dst: anytype, src: anytype,
              comptime config: anytype) bool;

// Absolute value.
//
// See also 'math.abs', which is more appropriate for scalar data.
//
//   T  Resulting Operation
// ---  -------------------
// f16  @fabsh(dst, src, config)
// f32  @fabss(dst, src, config)
inline fn abs(T: type, dst: anytype, src: anytype,
              comptime config: anytype) bool;

// Floating point normalization.
//
//   T  Resulting Operation
// ---  -------------------
// f16  @fnormh(dst, src, config)
// f32  @fnorms(dst, src, config)
inline fn norm(T: type, dst: anytype, src: anytype,
               comptime config: anytype) bool;

// Floating-point exponent scaling.
//
//    T  Resulting Operation
// ----  -------------------
//  f16  @fscaleh(dst, src0, src1, config)
//  f32  @scales(dst, src0, src1, config)
inline fn scale(T: type, dst: anytype, src0: anytype, src1: anytype,
                comptime config: anytype) bool;

// Elementwise maximum.
//
// See also 'math.max', which is more appropriate for scalar data.
//
//    T  Resulting Operation
// ----  -------------------
//  f16  @fmaxh(dst, src0, src1, config)
//  f32  @fmaxs(dst, src0, src1, config)
inline fn max(T: type, dst: anytype, src0: anytype, src1: anytype,
              comptime config: anytype) bool;

Example

The following example illustrates the use of dsd_ops to build a generic module that instantiates a local task with a given ID, and moves data from the given input color via the given input queue, into a user-specified buffer buf.

// Filename: reader.csl
param buf;
param taskId: local_task_id;
param inputColor: color;
param inputQueue: input_queue;

const dsd_ops = @import_module("<dsd_ops>");

comptime {
  @comptime_print(bufElemCount);
  @comptime_print(bufElemType);
}
const bufType = @type_of(buf.*);
const bufElemCount = @element_count(bufType);
const bufElemType = @element_type(bufType);
const bufDSD = @get_dsd(mem1d_dsd, .{ .base_address = buf,
                                      .extent = bufElemCount });

const inputDSD = @get_dsd(fabin_dsd, .{ .extent = bufElemCount,
                                        .fabric_color = inputColor,
                                        .input_queue = inputQueue });

task t() void {
  dsd_ops.mov(bufElemType, bufDSD, inputDSD, .{});
}

fn bind_and_activate() void {
  @bind_local_task(t, taskId);
  @activate(taskId);
}
// Filename: main.csl
var bufOfInt16 = @zeros([16]i16);
var bufOfFloat32 = @zeros([32]f32);

const readerInt16 = @import_module(
  "reader.csl",
  .{
    .buf = &bufOfInt16,
    .taskId = @get_local_task_id(8),
    .inputColor = @get_color(2),
    .inputQueue = @get_input_queue(2)
  }
);

const readerFloat32 = @import_module(
  "reader.csl",
  .{
    .buf = &bufOfFloat32,
    .taskId = @get_local_task_id(9),
    .inputColor = @get_color(3),
    .inputQueue = @get_input_queue(3)
  }
);

comptime {
  readerInt16.bind_and_activate();
  readerFloat32.bind_and_activate();
}

<empty>

This library is empty on purpose. This allows a conditional module import as follows:

const cache = @import_module(if (stage == 0) "<empty>" else cache_name);

<layout>

This library provides access to information about where the PE is located. Specifically, the x and y coordinates in the rectangle can be accessed at runtime, allowing code to be shared between PEs at different locations.

const layout_module = @import_module("<layout>");

// Return the 0-indexed x and y coord
// Only supported on WSE-2
layout_module.get_x_coord() u16;
layout_module.get_y_coord() u16;

<malloc>

The malloc library implements an arena allocator using a statically allocated buffer.

In arena allocators, a single buffer (arena) is used to ensure that all objects are allocated sequentially in memory. Allocating and deallocating memory are fast operations, requiring an addition and/or assignment. The free operation frees all allocated objects at once.

The parameter buffer_num_words specifies the number of words of the statically allocated buffer.

If the param asserts_enabled is true, all allocations assert that the buffer has enough free memory. The default is false.

// specify buffer size
const mem = @import_module("<malloc>", .{buffer_num_words = <buffer_size>});

// mem provides the following API:

// returns pointer to num_values of type T
fn malloc(comptime T: type, num_values: u16) [*]T

// non-generic versions, return pointer to num_values of corresponding type
fn malloc_i16(num_values:u16) [*]i16
fn malloc_u16(num_values:u16) [*]u16
fn malloc_f16(num_values:u16) [*]f16
fn malloc_i32(num_values:u16) [*]i32
fn malloc_u32(num_values:u16) [*]u32
fn malloc_f32(num_values:u16) [*]f32

// returns true if an allocation of num_words elements of type T would
// succeed
fn has_enough_space(comptime T: type, num_values: u16) bool

// non-generic versions
// returns true if an allocation of num_words elements of type
// i16, u16, f16 would succeed:
fn has_enough_words(num_words:u16) bool
// returns true if an allocation of num_words elements of type
// i32, u32, f32 would succeed:
fn has_enough_double_words(num_double_words:u16) bool

// frees the entire buffer
fn free() void;

<math>

The math library functions are named using the convention <operationName>_<principalType>(). So for example the sin function over f32 values has the name sin_f32.

Math constants

The following can be used anywhere a floating point number is needed.

const PI : comptime_float
const E : comptime_float

Math functions

The math library provides standard mathematical functions. They are written as generic functions to facilitate use in other libraries or abstractions. In addition, non-generic f16 and f32 functions are provided. These functions have names suffixed with _f16 and _f32, respectively.

The following functions are provided:

// T can be f16 or f32
fn POSITIVE_INF(comptime T: type) : T
fn NEGATIVE_INF(comptime T: type) : T
fn NaN(comptime T: type) : T

// x can be f16, f32, i8, i16, i32, i64,
// u8, u16, u32, or u64
fn abs(x: anytype) @type_of(x)
fn max(x: anytype, y: @type_of(x)) @type_of(x)
fn min(x: anytype, y: @type_of(x)) @type_of(x)
fn sign(x: anytype) @type_of(x)

// x can be f16 or f32
fn ceil(x: anytype) @type_of(x)
fn cos(x: anytype) @type_of(x)
fn exp(x: anytype) @type_of(x)
fn floor(x: anytype) @type_of(x)
fn fscale(f: anytype, s: i16) @type_of(f)
fn inv(x: anytype) @type_of(x)
fn invsqrt(x: anytype) @type_of(x)
fn isNaN(x: anytype) bool
fn isInf(x: anytype) bool
fn isFinite(x: anytype) bool
fn isSignaling(x: anytype) bool
fn log(x: anytype) @type_of(x)
fn pow(x: anytype, y: @type_of(x)) @type_of(x)
fn sig(x: anytype) @type_of(x)
fn signbit(x: anytype) bool
fn sin(x: anytype) @type_of(x)
fn sqrt(x: anytype) @type_of(x)
fn tanh(x: anytype) @type_of(x)

Example

const math = @import_module("<math>");

var x: f16;

task t() void {
  if (!math.isFinite(x)) {
    x = 0.0;
  }
  var one = math.pow(math.sin(x), 2.0) + math.pow(math.cos(x), 2.0);
  if (math.abs(math.log(one) - 1.0) > 0.001) {
    x = math.NaN(f16);
  }
}

The same code can be written using non-generic functions:

task t() void {
  if (!math.isFinite_f16(x)) {
    x = 0.0;
  }
  var one = math.pow_f16(math.sin_f16(x), 2.0) +
    math.pow_f16(math.cos_f16(x), 2.0);
  if (math.abs_f16(math.log_f16(one) - 1.0) > 0.001) {
    x = math.NaN_f16;
  }
}

Note on sin and cos accuracy

Both f16 and f32 versions of sin and cos will produce incorrect results when abs(x) ≥ 16384π (approximately 51472).

<random>

The random library provides utility functions that wrap the @random16 builtin to create random values across various ranges and distributions.

See @random16 for information on the PRNG used by these functions.

// sets the global state of the PRNG number `prng_id` to `seed`
fn set_global_prng_seed(seed: u32) void

// generate a random 16-bit number in the range [lower, upper)
fn random_f16(lower: f16, upper: f16) f16

// generate a random 32-bit number in the ragne [lower, upper)
fn random_f32(lower: f32, upper: f32) f32

// generate a uniform random number in the range [0, 2^pow)
fn random_pow_u32(pow : u16) u32

// generate a normally distributed number using the Box-Muller transform
fn random_normal_f32() f32

<simprint>

The simprint library contains functions to print strings and various numeric data types to the simulator logs. This is intended primarily for debugging, as the printed output is not visible when running on hardware.

Messages produced by the simprint library are stored by the simulator in fixed-size buffers, with one buffer per PE. A buffer will be flushed, with its contents printed to the simulator logs, when the buffer is full or a "\n" newline character is produced. Any data remaining in a PE’s print buffer at the end of simulator execution will be silently discarded.

Basic printing functions

// Prints a comptime string `s` to the simulator logs.
//
// Note that if the string contains a zero (NUL) byte, the output will be
// truncated at the NUL.
fn print_string(comptime s: comptime_string) void;

// Prints an unsigned 16-bit integer `x` in binary. The output will be 16
// characters wide, with zero-padding inserted on the left if needed.
fn print_u16_binary(x: u16) void;

// Prints an unsigned 16-bit integer `x` in decimal.
fn print_u16_decimal(x: u16) void;

// Prints an unsigned 16-bit integer `x` in hex. The output will be 4
// characters wide, with zero-padding inserted on the left if needed.
fn print_u16_hex(x: u16) void;

// Prints a 16-bit floating-point value `x` in decimal.
fn print_f16(x: f16) void;

// Prints an unsigned 32-bit integer `x` in binary. The output will be 32
// characters wide, with zero-padding inserted on the left if needed.
fn print_u32_binary(x: u32) void;

// Prints an unsigned 32-bit integer `x` in decimal.
fn print_u32_decimal(x: u32) void;

// Prints an unsigned 32-bit integer `x` in hex. The output will be 8
// characters wide, with zero-padding inserted on the left if needed.
fn print_u32_hex(x: u32) void;

// Prints a 32-bit floating-point value `x` in decimal.
fn print_f32(x: f32) void;

For example:

// Assume the print buffer is empty to start with.

// "42" will _not_ immediately be displayed in the simulator logs by the
// following statement.
simprint.print_u16_decimal(42);

// The following statement will force the buffer to flush, so the "42"
// will be visible in the logs.
simprint.print_string("\n");

Format strings

Two functions are provided to print formatted strings:

// Prints a formatted string to the simulator logs. The format string is
// a compile-time string, and the arguments are the values to be inserted
// into the format string. A newline is automatically printed after the
// formatted string.
fn fmt(comptime fstr: comptime_string, args: anytype) void;

// As above, but does not print a newline after the formatted string.
// Note that output is not flushed to the simulator logs until a newline
// is encountered or an internal buffer fills up, so it is recommended to
// follow up with something that will print a newline.
fn fmt_no_nl(comptime fstr: comptime_string, args: anytype) void;

// Like 'fmt', but prepends the coordinates of the current PE to the
// output line, in the format "PE(X,Y): ".
fmt fmt_with_coords(comptime fstr: comptime_string, args: anytype) void;

Format specifiers are wrapped in curly braces, and correspond positionaly to the arguments in args. Available format specifiers are:

  • {d}: print the argument as a decimal number. Argument must have type u16 or u32.

  • {X}: print the argument as a hexadecimal number in upper case. Argument must have type u16 or u32.

  • {b}: print the argument as a binary number. Argument must have type u16 or u32.

  • {f}: print the argument as a floating-point number. Argument must have type f16 or f32.

A literal { character may be escaped by doubling it. For example, {{hello} will print as {hello}.

Warning

Code that uses fmt or fmt_no_nl is likely to exceed the compiler’s default limit on inline loop unrolling. If you encounter the error exceeded the maximum of 50 inlined iterations when compiling, you can add the flag --max-inlined-iterations=1000000 to the compile command to bypass this issue.

For example:

simprint.fmt(
  "{d} {X} {b} {f}",
  .{ @as(u16,42), @as(u16,42), @as(u16,42), @as(f16,42.0) }
);

// The above will print:
// 42 002A 0000000000101010 42.0

simprint.fmt_no_nl(
  "{d} {X}",
  .{ @as(u16,42), @as(u16,42) }
);
simprint.print_string(" ");
simprint.fmt_no_nl(
  "{b} {f}",
  .{ @as(u16,42), @as(f16,42.0) }
);
simprint.print_string("\n");

// The above will also print:
// 42 002A 0000000000101010 42.0

Disabling output

Sometimes it is useful to disable all of the debug prints produced by a particular instance of the simprint module, while keeping the option to turn them back on later. This helps save on runtime and space overhead, and can also be used to conditionally enable or disable debug printing on certain PEs. Prints originating from a specific simprint instance can be disabled by setting the enable parameter to false at import time.

const simprint = @import_module("<simprint>", .{ .enable = false });

The enable parameter is optional. Its default value is true, which means that printing is enabled.

<tile_config>

The tile_config library contains APIs relating to the hardware configuration of a PE. It contains the following top-level constants:

// The base addresses of memory-mapped registers
const addresses: enum(reg_type)

// The type of a word-sized memory-mapped register
const reg_type: type

// The type of a memory-mapped register occupying two words
const double_reg_type: type

// The name of the target architecture, such as "wse2"
const target_name: comptime_string

// The size of a word in bytes
const word_size: comptime_int

The tile_config library also contains an API to access the PE’s coordinates in the rectangle at runtime.

const fabric_coord: enum(reg_type) {
  X,
  Y
};
fn get_fabric_coord(dimension: fabric_coord) u16

color_config

This submodule of tile_config contains APIs and an enum type for changing the configuration of a given color during a teardown phase.

First of all, the color_config submodule defines the following enum type:

const fabric_io = enum(u16) {
  TX_WEST,
  TX_EAST,
  TX_SOUTH,
  TX_NORTH,
  TX_RAMP,

  RX_WEST,
  RX_EAST,
  RX_SOUTH,
  RX_NORTH,
  RX_RAMP
};

This enum consists of all the input and output routing directions which can be used to specify the routing direction we wish to modify.

Specifically, the color_config library consists of the following functions:

// Returns the word address of a color configuration that corresponds to
// color `c`.
fn get_color_config_addr(c: color) reg_type

// Enables `dir` direction for a color `c` or a word address `c`
// of a color configuration register.
fn set_io_direction(c: anytype, dir: fabric_io) void

// For a given color `c` or a word address `c` of a color configuration
// register, toggle the I/O direction `dir`.
fn toggle_io_direction(c: anytype, dir: fabric_io) void

// For a given color `c` or a word address `c` of a color configuration
// register, clear the setting for `dir`.
fn clear_io_direction(c: anytype, dir: fabric_io) void

// For a given color `c` or a word address `c` of a color configuration
// register, clear all I/O routes and reset them according to `new_routes`.
// The input routes are specified as the `rx` field of `new_routes` which
// must be an array of `direction` values or a single `direction` value.
// Similarly, the output routes are specified as the `tx` field of
// `new_routes` which must be an array of `direction` values or a single
// `direction` value. In `WSE3`, if `rx` is an array, it must have a
// single element (i.e., it must hold a value of type [1]direction).
fn reset_routes(c: anytype, comptime new_routes: comptime_struct) void

These functions can be used as follows:

param red: color;
var blue: color;
const tile_config = @import_module("<tile_config>");
const color_config = tile_config.color_config;
const red_addr = color_config.get_color_config_addr(red);

task teardown() void {
  // Color `blue` is a `var` and therefore not known until runtime.
  const addr = color_config.get_color_config_addr(blue);

  // We can manipulate the I/O routing configuration for a
  // comptime-known color, a runtime color or an address to a
  // color configuration register.
  const offramp = color_config.fabric_io.TX_RAMP;
  const onramp = color_config.fabric_io.RX_RAMP;
  const rx_north = color_config.fabric_io.RX_NORTH;
  color_config.set_io_direction(red, offramp);
  color_config.set_io_direction(blue, onramp);
  color_config.set_io_direction(addr, rx_north);

  // Reset the routes for color `red` using multiple directions.
  color_config.reset_routes(red, .{.tx = [3]direction{NORTH, SOUTH, WEST},
                                   .rx = [2]direction{RAMP, EAST}
                                  });
  // Reset the routes for color `red` using single directions.
  color_config.reset_routes(red, .{.tx = NORTH, .rx = EAST});
}

control_transform

This submodule of tile_config contains a function for setting the mask for transforming the index part of control wavelets. This function is to be used together with the DSD property control_transform to XOR the first six bits of the index portion of a wavelet with the specified mask.

fn set_mask(mask: reg_type) void

This function can be used like:

const tile_config = @import_module("<tile_config>");
const ctrl_xform = tile_config.control_transform;

var in_dsd = @get_dsd(fabin_dsd, .{ .fabric_color = recv_channel,
                                    .extent = 100,
                                    .input_queue = @get_input_queue(0),
                                    .control_transform = true });
const out_dsd = @get_dsd(fabout_dsd, .{ .extent = 100,
                                        .fabric_color = send_channel,
                                        .output_queue = @get_output_queue(1),
                                        .control_transform = true });

var buf = @zeros([5]u32);
const fifo = @allocate_fifo(buf);

task buffer() void {
  @mov32(fifo, in_dsd, .{ .async = true });
  @mov32(out_dsd, fifo, .{ .async = true });
}

comptime {
  ctrl_xform.set_mask(2);
}

The set_mask function can be used either at comptime or runtime. Only the first six bits of the mask are taken into account.

exceptions

This submodule of tile_config contains functions for setting values in the exception mask register. The exception mask register determines which exceptions cause the processor to stop. An unmasked exception causes the processor to immediately stop execution. A masked exception allows execution to continue. By default, all exceptions are masked. The functions in this submodule can be used to unmask them.

// Exceptions which can be unmasked with the below functions
const PERF_CNT_0_OVERFLOW = ...;
const PERF_CNT_1_OVERFLOW = ...;
const SW_EXCEPTION = ...;
const FP_UNDERFLOW = ...;
const FP_OVERFLOW = ...;
const FP_INEXACT = ...;
const FP_INVALID = ...;
const FP_DIV_BY_0 = ...;

// Unmask one of the exceptions above
fn set_exception_mask(exception_mask: reg_type) void;

This submodule can be used as follows:

const tile_config = @import_module("<tile_config>");
const exceptions = tile_config.exceptions;

fn fp_div_by_0() f32 {
  // Set exception mask for FP_DIV_BY_0.
  // When floating point divide by zero occurs,
  // processor will stop execution.
  exceptions.set_exception_mask(exceptions.FP_DIV_BY_0);

  var x : f32 = 42.0;
  var y : f32 = 0.0;

  // This operation is a divide by zero, so processor should hang
  return x / y;
}

Each call to set_exception_mask overwrites the exception mask register. Multiple exceptions can be unmasked simultaneously as follows:

// FP_DIV_BY_0 is unmasked.
exceptions.set_exception_mask(exceptions.FP_DIV_BY_0);

// FP_DIV_BY_0 is masked again.
// FP_OVERFLOW and FP_UNDERFLOW are now unmasked.
exceptions.set_exception_mask(exceptions.FP_OVERFLOW
                            & exceptions.FP_UNDERFLOW);

filters

This submodule of tile_config contains APIs for configuring filters:

// The number of filters provided by the architecture.
const num_filters: comptime_int

// Set the active limit of a counter filter identified by `filter_id`
// to `limit`.
fn set_active_limit(filter_id: u16, limit: reg_type) void

// Set the maximum counter value of a counter filter identified by
// `filter_id` to `max_counter`.
fn set_max_counter(filter_id: u16, max_counter: reg_type) void

// Set the counter value of a counter filter identified by `filter_id`
// to `counter`.
fn set_counter(filter_id: u16, counter: reg_type) void

These functions can be used like:

const config = @import_module("<tile_config>");
// Set the counter of filter ID 1 to 0
config.filters.set_counter(1, 0);

main_thread_priority

This submodule of tile_config contains APIs for configuring main thread priority. The main thread is the thread that executes non-async operations. Operations tagged with async execute on a microthread, which is associated with a fabric input or output queue. Main thread priority and microthread priority determine the relative scheduling priority of the threads.

// Enum for main thread priorities. The meanings of main thread priority
// levels are relative to microthread priorities, as follows:
//
//   MEDIUM_LOW: Between low- and medium-priority microthreads.
//   MEDIUM: Same priority as medium-priority microthreads.
//   MEDIUM_HIGH: Between medium- and high-priority microthreads.
//   HIGH: Same priority as high-priority microthreads.
const level = enum(u16) {
  MEDIUM_LOW = ...,
  MEDIUM = ...,
  MEDIUM_HIGH = ...,
  HIGH = ...
};

// Updates the priority for the main thread to `priority`. Note that updates
// to main thread priority made at runtime may take a few clock cycles to
// take effect. This function may be used at comptime or at runtime.
fn update_main_thread_priority(priority: level) void;

This function can be used like:

const config = @import_module("<tile_config>");
const mt_priority = config.main_thread_priority;

comptime {
  mt_priority.update_main_thread_priority(mt_priority.level.MEDIUM);
}

task main() void {
  mt_priority.update_main_thread_priority(mt_priority.level.MEDIUM_HIGH);
}

switch_config

This submodule of tile_config contains APIs and enum types that can be used to change the switch configuration of a given color during a teardown phase.

First of all, the switch_config submodule defines the following enum types:

const pop_mode = enum(u16) {
  NO_POP,
  ALWAYS_POP,
  POP_ON_ADVANCE,

  CLEAR_MASK
};

const switch_status = enum(u16) {
  CLEAR_CURRENT_POS
};

These enum types represent specific setting categories like pop mode and switch status and they can be used to specify the settings that we want to modify in a per-category manner.

In addition, the switch_config submodule consists of the following functions:

// Returns the word address of a given color `c` and switch setting type
// `setting_ty`.
fn get_switch_config_addr(c: color, comptime setting_ty: type) reg_type

// Returns the word address of a color configuration that corresponds to
// color `c`.
fn get_color_config_addr(c: color) reg_type

// For a given color `c` or a word address `c` of a color configuration
// register, clear its current switch position.
fn clear_current_position(c: anytype) void

// For a given color `c` or a word address `c` of a color configuration
// register, set the given ring mode to NO_RING_MODE or RING_MODE.
fn set_ring_mode(c: anytype, comptime mode: ring_mode) void

// For a given color `c` or a word address `c` of a color configuration
// register, set the given pop mode `mode` after clearing the previous one.
fn set_pop_mode(c: anytype, mode: pop_mode) void

// For a given color `c` or a word address `c` of a color configuration
// register, make all switch positions invalid.
fn set_invalid_for_all_switch_positions(c: anytype) void

// For a given color `c` or a word address `c` of a color configuration
// register, set RX direction of switch position 1 to `dir`.
fn set_rx_switch_pos1(c: anytype, comptime dir: direction) void

// For a given color `c` or a word address `c` of a color configuration
// register, set TX direction of switch position 1 to `dir`.
fn set_tx_switch_pos1(c: anytype, comptime dir: direction) void

// For a given color `c` or a word address `c` of a color configuration
// register, set RX direction of switch position 1 to `dir_rx`, and
// TX direction of switch position 1 to `dir_tx`.
// NOTE: This function is supported on wse3 and beyond only.
fn set_rxtx_switch_pos1(c: anytype, comptime dir_rx: direction,
                        comptime dir_tx: direction) void

// For a given color `c` or a word address `c` of a color configuration
// register, set RX direction of switch position 2 to `dir`.
fn set_rx_switch_pos2(c: anytype, comptime dir: direction) void

// For a given color `c` or a word address `c` of a color configuration
// register, set TX direction of switch position 2 to `dir`.
fn set_tx_switch_pos2(c: anytype, comptime dir: direction) void

// For a given color `c` or a word address `c` of a color configuration
// register, set RX direction of switch position 2 to `dir_rx`, and
// TX direction of switch position 2 to `dir_tx`.
// NOTE: This function is supported on wse3 and beyond only.
fn set_rxtx_switch_pos2(c: anytype, comptime dir_rx: direction,
                        comptime dir_tx: direction) void

// For a given color `c` or a word address `c` of a color configuration
// register, set RX direction of switch position 3 to `dir`.
fn set_rx_switch_pos3(c: anytype, comptime dir: direction) void

// For a given color `c` or a word address `c` of a color configuration
// register, set TX direction of switch position 3 to `dir`.
fn set_tx_switch_pos3(c: anytype, comptime dir: direction) void

// For a given color `c` or a word address `c` of a color configuration
// register, set RX direction of switch position 3 to `dir_rx`, and
// TX direction of switch position 3 to `dir_tx`.
// NOTE: This function is supported on wse3 and beyond only.
fn set_rxtx_switch_pos3(c: anytype, comptime dir_rx: direction,
                        comptime dir_tx: direction) void

// For a given color `c` or a word address `c` of a color configuration
// register, set the switch position to `posn`,
// where `posn` is 0, 1, 2, or 3.
fn set_current_switch_position(c: anytype, posn: u16) void

These functions can be used as follows:

param red: color;
var blue: color;
const tile_config = @import_module("<tile_config>");
const switch_config = tile_config.switch_config;
const switch_status_addr =
      switch_config.get_switch_config_addr(red,
                                          switch_config.switch_status);

task teardown() void {
  // Color `blue` is a `var` and therefore not known until runtime.
  const addr = switch_config.get_switch_config_addr(blue,
                                                    switch_config.pop_mode);

  // We can manipulate the switch configuration using a
  // comptime-known color, a runtime color or an address to
  // a color configuration register.
  switch_config.clear_current_position(red);
  switch_config.clear_current_position(blue);
  switch_config.clear_current_position(switch_status_addr);

  // Reset the pop mode to a new setting.
  switch_config.set_pop_mode(red, switch_config.pop_mode.ALWAYS_POP);
}

task_priority

This submodule of tile_config contains APIs for configuring task priority:

// Enum for task priorities: either HIGH or LOW.
const level = enum(u16) {
  LOW = 0,
  HIGH = 1
};

// Updates the task priority associated with `task_id` to `priority`.
fn update_task_priority(task_id: anytype, priority: level) void

// Sets the task priority associated with `task_id` to high.
fn set_task_priority(task_id: anytype) void

// Sets the task priority associated with `task_id` to low.
fn clear_task_priority(task_id: anytype) void

The provided task_id can be a data_task_id or local_task_id to set the priority of the associated task.

In addition, the priority of tasks activated by wavelets, including tasks bound to a control_task_id, can be specified using the color on WSE-2, or the input_queue on WSE-3, that carries the wavelets.

Note that updates to task priority made at runtime may take a few clock cycles to take effect. These functions may be used at comptime or at runtime.

These functions can be used like:

const config = @import_module("<tile_config>");
const task_priority = config.task_priority;
const task_priority_level = task_priority.level;

param high_id: data_task_id;
param low_id: local_task_id;

comptime {
  // Equivalent to:
  //   task_priority.update_task_priority(
  //     high_id,
  //     task_priority_level.HIGH);
  task_priority.set_task_priority(high_id);
}

task main() void {
  // Equivalent to:
  //   task_priority.update_task_priority(
  //     low_id,
  //     task_priority_level.LOW);
  task_priority.clear_task_priority(low_id);
}

teardown

This submodule of tile_config contains teardown APIs:

// Returns the task ID that is reserved for the teardown handler.
fn get_task_id() local_task_id

// Return the values of the "teardown-pending" registers combined into
// one value. Only the first invocation of this function per-task is
// guaranteed to return the correct value. Any additional calls per-task
// will have undefined results.
fn get_pending() double_reg_type

// Given a value that represents the "teardown-pending" state, which has 1
// bit per routable color indicating the ones that are currently in
// teardown state, return `true` iff the input color `c` is in teardown
// state.
fn is_pending(value: double_reg_type, c: color) bool

// Exit the teardown state for a given color `c`.
fn exit(c: color) void

These functions can be used like:

const config = @import_module("<tile_config>");
// Check if teardown is pending on color 8 or 9
var pendings = config.teardown.get_pending();
bool pending_8_or_9 = config.teardown.is_pending(pendings, @get_color(8)) or
    config.teardown.is_pending(pendings, @get_color(9));

<time>

The time library returns the current 48-bit timestamp counter as three 16-bit unsigned integers in little endian form.

// enable tsc registers for capturing timestamps
fn enable_tsc() void;

// disable tsc registers for capturing timestamps
fn disable_tsc() void;

// write timestamp to array of three u16 values
fn get_timestamp(result : *[3]u16) void;

// reset tsc register to 0
fn reset_tsc_counter() void;

<kernels>

This library differs from all other libraries in that it provides kernels, as opposed to individual functions. The “tally” kernel implements a two-phase tally, used to coordinate the work done by multiple PEs. It is documented in the kernel code itself.

<tally>

The tally library implements a two-phase tally kernel that allows PEs within a rectangle to communicate progress/completion to the host.

The library consists of two modules:

  1. <kernels/tally/layout>: imported once and use in the layout block to parameterize each PE’s tally behavior.

  2. <kernels/tally/pe>: imported once by each PE, consuming the parameters generated by the layout module.

A minimal example of importing and using both modules, starting with the layout module:

// code.csl

const tally = @import_module("<kernels/tally/layout>", .{
  .kernel_height=8,
  .kernel_width=4,
  .phase2_tally=0,
  .colors=[3]color{@get_color(1), @get_color(2), @get_color(3)},
  .output_color=@get_color(0),
});

layout {
  @set_rectangle(4, 8);

  for (@range(u16, 4)) |i| {
    for (@range(u16, 8)) |j| {
      @set_tile_code(i, j, "pe.csl", .{
        .tally_params = tally.get_params(i, j),
      });
    }
  }
}

And the per-PE module:

// pe.csl

param tally_params: comptime_struct;

// On WSE-2, input_queues and output_queues can be the same.
// On WSE-3, they must be different.
const tally = @import_module("<kernels/tally/pe>",
  @concat_structs(tally_params, .{
    .input_queues=[2]u16{0, 1},
    .output_queues=[2]u16{0, 1},
  }));

task done() void {
  tally.signal_completion();
}

...

The tally kernel operates in two phases.

In the first phase, every PE must signal completion at least once. For kernels where each PE knows when it is finished, this is the only phase needed.

The first phase ends when every PE has signaled completion at least once. During the second phase, PEs can bump (increase) the global tally. When the global tally meets or exceeds the phase2_tally parameter, the kernel signals completion by sending the total to the North on output_color from the PE at (kernel_width - 1, 0).

The second phase is optional. If phase2_tally == 0, the second phase will be skipped and the output signal on output_color will be 0.

<collectives_2d>

This library implements collective communication directives that allows PEs to communicate data with one another.

The library consists of two modules:

  1. <collectives_2d/params>: Imported once to parameterize each PE in the layout block.

  2. <collectives_2d/pe>: Imported once per dimension per PE. Contains collective communication directives for a single axis.

<collectives_2d/params>

The parameter module exposes a compile-time helper function for configuring PEs to use <collectives_2d>

fn get_params(Px: u16, Py: u16, ids: comptime_struct) comptime_struct
  • Px is the PE’s x-coordinate.

  • Py is the PE’s y-coordinate.

  • ids is a struct that is expected to have either the x-related fields, the y-related fields, or all four, of the following:

    • x_colors: a struct containing 2 distinct colors as anonymous fields

    • x_entrypoints: a struct containing 2 distinct local task IDs as anonymous fields

    • y_colors: a struct containing 2 distinct colors as anonymous fields

    • y_entrypoints: a struct containing 2 distinct local task IDs as anonymous fields

  • Returns a struct containing the parameters necessary to import library modules for the specified PE. This struct contains:

    • x: an opaque struct containing parameters needed to configure collective communications in the x-dimension.

    • y: an opaque struct containing parameters needed to configure collective communications in the y-dimension.

<collectives_2d/pe>

The following directives are currently supported:

fn init() void
fn broadcast(root: u16, buf: [*]u32, count: u16, callback: local_task_id) void
fn scatter(root: u16, send_buf: [*]u32, recv_buf: [*]u32, count: u16,
           callback: local_task_id) void
fn gather(root: u16, send_buf: [*]u32, recv_buf: [*]u32, count: u16,
          callback: local_task_id) void
fn reduce_fadds(root: u16, send_buf: [*]f32, recv_buf: [*]f32, count: u16,
                callback: local_task_id) void

init initializes the library. It must be invoked for each axis.

broadcast transmits the contents of buf from the root PE to the buf of other PEs in the row or column. count should be the length of buf. It is akin to MPI_Bcast.

scatter transmits count-many elements from send_buf from the root PE to the recv_buf of other PEs in the row/column. It is akin to MPI_Scatter.

gather accumulates count-many elements from send_buf of other PEs into the recv_buf of the root PE. It is akin to MPI_Gather.

When distributing or aggregating elements using scatter or gather for N PEs, the send_buf or recv_buf should have space for count * N elements, respectively.

reduce_fadds computes an MPI_Sum for buffers of f32.

In general, all PEs must call the same directive with same root and count. The primitives have the following common parameters:

  • root is the root PE for network configuration,

  • send_buf is a buffer containing data to be transmitted,

  • recv_buf is a buffer for holding data received,

  • count is the number of elements to be transmitted,

  • callback is activated when the primitive completes.

The user can configure the resources of collectives_2d. Each imported module must be assigned queue IDs (queues) and DSR IDs (dest_dsr_ids, src0_dsr_ids, src1_dsr_ids). If the user does not specify these parameters explicitly, the default values apply. The following example shows the default values of queue IDs and DSR IDs of collectives_2d.

A minimal example that sets up PEs to broadcast 10 elements from the root PE to every other PE in the row/column consists of the following layout code:

// code.csl

param width: u16;
param height: u16;
param root: u16;

const c2d = @import_module("<collectives_2d/params>");

layout {
  @set_rectangle(width, height);

  var x: u16 = 0;
  while (x < width) : (x += 1) {
    var y: u16 = 0;
    while (y < height) : (y += 1) {
      const c2d_params = c2d.get_params(x, y, .{
        .x_colors = .{
          @get_color(0),
          @get_color(1)
        },
        .x_entrypoints = .{
          @get_local_task_id(2),
          @get_local_task_id(3)
        },
        .y_colors = .{
          @get_color(4),
          @get_color(5)
        },
        .y_entrypoints = .{
          @get_local_task_id(6),
          @get_local_task_id(7)
        },
      });
      @set_tile_code(
        x,
        y,
        "pe.csl",
        .{ .root = root, .c2d_params = c2d_params }
      );
    }
  }
}

And the per-PE module:

// pe.csl

param c2d_params: comptime_struct;

const rect_height = @get_rectangle().height;
const rect_width = @get_rectangle().width;

// Pick two task IDs not used in the library for callbacks
const x_task_id = @get_local_task_id(15);
const y_task_id = @get_local_task_id(16);

const len = 10;
var x_data = @zeros([len]u32);
var y_data = @zeros([len]u32);

const mpi_x = @import_module(
  "<collectives_2d/pe>",
  .{ .dim_params = c2d_params.x,
     .queues = [2]u16{2,4},
     .dest_dsr_ids = [1]u16{1},
     .src0_dsr_ids = [1]u16{1},
     .src1_dsr_ids = [1]u16{1}
   }
);

const mpi_y = @import_module(
  "<collectives_2d/pe>",
  .{ .dim_params = c2d_params.y,
     .queues = [2]u16{3,5},
     .dest_dsr_ids = [1]u16{2},
     .src0_dsr_ids = [1]u16{2},
     .src1_dsr_ids = [1]u16{2}
   }
);

task x_task() void {
  var send_buf = @ptrcast([*]u32, &x_data);
  var recv_buf = @ptrcast([*]u32, &@zeros[len]u32);

  if (root == mpi_x.pe_id) {
    mpi_x.broadcast(root, send_buf, len, x_task_id);
  } else {
    mpi_x.broadcast(root, recv_buf, len, x_task_id);
  }
}

task y_task() void {
  var send_buf = @ptrcast([*]u32, &y_data);
  var recv_buf = @ptrcast([*]u32, &@zeros[len]u32);

  if (root == mpi_y.pe_id) {
    mpi_y.broadcast(root, send_buf, len, y_task_id);
  } else {
    mpi_y.broadcast(root, recv_buf, len, y_task_id);
  }
}