SDK Release Notes
Contents
SDK Release Notes¶
The following are the release notes for the Cerebras SDK.
Version 1.4.0¶
Released 26 May 2025
Note
The Cerebras Wafer-Scale Cluster appliance running Cerebras ML Software 2.4 supports SDK 1.3. See here for SDK 1.3 documentation.
The Cerebras Wafer-Scale Cluster appliance running Cerebras ML Software 2.5 supports SDK 1.4, the current version of SDK software.
New features and enhancements¶
(beta) New
SdkLayoutprogram layout specification API:Introduces a new
SdkLayoutPython API for specifying program layout. This API allows the user to define retangular code regions, define color routing and switching, automatically allocate colors, and automatically route between code regions.Introduces several example programs demonstrating the use of the
SdkLayoutAPI. See the list of new example programs below.Introduces new documentation for this API. See SdkLayout API Reference.
This API is in beta. The
memcpyAPI for data transfers and remote kernel launches is not currently supported. CSL libraries with their own internal color routing are not currently supported.
CSL language and compiler enhancements:
@mapnow supports explicit DSR arguments. DSR input arguments must bedsr_src1and DSR output arguments must bedsr_dest. All DSR arguments should be loaded with thesingle_stepproperty set. For example:param inDSR: dsr_src1; param outDSR: dsr_dest; task foo() void { // Compute the square-root of each element of `memDSD` and // send it out to `faboutDSD`. @load_to_dsr(inDSR, memDSD, .{.single_step = true}); @load_to_dsr(outDSR, faboutDSD, .{.single_step = true}); @map(math_lib.sqrt_f16, inDSR, outDSR); }
Introduces support for
cb16(cbfloat16) andbfloat16(bfloat) 16-bit floating point types, and the associated@fp16()builtin. See @fp16 and Type System in CSL.cbfloat16is a Cerebras-specific 16-bit floating point format with a 6-bit exponent and 9-bit explicit mantissa.On WSE-3, introduces support for microthread priority via the
.priorityfield in@get_dsdforfabin_dsdandfabout_dsd, and in@allocate_fifo. See Data Structure Descriptors.
CSL library enhancements:
Introduces 3D FFT kernel library. See <fft>.
Introduces
tile_config.input_queue_statusandtile_config.output_queue_statusto query input and output queue full/ empty status registers. See input_queue_status and output_queue_status.
SdkRuntimehost runtime enhancements:Introduces the
SdkRuntimedirect link API functionssendandreceive, which are used to stream data into or out of the wafer via program input and output ports. This API can be used withSdkLayoutas demonstrated in SdkLayout 4: Host-to-device and device-to-host data streaming. See SdkRuntime API Reference.
Example programs:
Introduces a series of example programs demonstrating the new
SdkLayoutAPI:SdkLayout 1: Introduction introduces the
SdkLayoutAPI with a single-PE program.SdkLayout 2: Basic routing demonstrates color routing with the
SdkLayoutAPI and automatic color allocation.SdkLayout 3: Ports and connections demonstrates automatic routing between code regions.
SdkLayout 4: Host-to-device and device-to-host data streaming demonstrates the use of the
SdkRuntimedirect link API withSdkLayoutto create host-to-device and device-to-host streams.SdkLayout 5: Generalized matrix-vector multiplication (GEMV) implements a full GEMV program with the
SdkLayoutAPI.
Introduces an example using the 3D FFT kernel library. See 3D FFT.
Resolved issues¶
Fixes incorrect parsing of CSL if statements whose body is an assignment without braces (e.g.
if (cond) lhs = rhs;)On WSE-2, fixes bug in which
@set_color_configdid not support all 6 available filters. Previously, only the first four were available.Fixes potential stall caused by sending many small data transfers via
SdkRuntime.Appliance mode compilation via
SdkCompilerno longer allocates a system while compiling.Appliance mode SDK jobs launched via
SdkCompiler,SdkLauncher, orSdkRuntimenow exit gracefully.
Known issues¶
The
25-pt-stencil,histogram-torus, andspmv-hypersparsebenchmark examples are not supported on WSE-3.Instruction traces in the SDK GUI are not supported on WSE-3.
The bandwidth of memory transfers saturates at around 8 IO channels.
Deprecations¶
In CSL, calling a task is now an error. Only functions may be called. Tasks must be activated.
In CSL, dereference or access of pointers into config space is now illegal. The
@get_configand@set_configbuiltins should be used instead.WSE-1 is no longer supported.
Version 1.3.0¶
Released 13 December 2024
New features and enhancements¶
CSL language and compiler enhancements:
For DSD definitions, a tensor access expression is now shorthand for a
comptime_structwithextent,stride, andbase_addressfields. DSDs can now also be specified using these fields directly, for example:// These two definitions are equivalent: var my_dsd = @get_dsd(mem1d_dsd, .{ .extent = 10, .stride = 2, .base_address = &my_arr }); var my_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{10} -> my_arr[2*i] });
strideis an optional parameter with default value 1. See tensor_access for more information.Memory DSD properties can now take runtime values when using the individual field specification format. However,
mem4d_dsdextent and stride must still be comptime known.Introduces inline functions, which are expanded during semantic analysis. See Syntax of CSL for more information.
Introduces labeled
breakand the ability to break values from blocks. See Syntax of CSL for more information.Improves performance of CSL’s parser, potentially improving program compile times.
Improves DSR allocation diagnostics when using DSDs. Upon failure to allocate, diagnostics now contain information about operations that prevent a DSR from being allocated.
CSL library enhancements:
Introduces a
<dsd_ops>library which provides wrappers around DSD op builtins that select an appropriate builtin depending on the underlying data types, enabling more concise and flexible code when supporting multiple data types. See <dsd_ops> for more information.
SdkRuntimehost runtime enhancements:Introduces a strided version of
memcpy_h2dfor strided host-to-device data transfers. Seememcpy_h2d_stridein SdkRuntime API Reference.Introduces row and column broadcast variants of
memcpy_h2dfor host-to-device row and column broadcasts. Seememcpy_h2d_colbcastandmemcpy_h2d_rowbcastin SdkRuntime API Reference. Also see the example program Host-to-Device Broadcast Test.
Example programs:
Introduces a new example program Host-to-Device Broadcast Test to demonstrate row and column broadcasts for host-to-device data transfers.
Resolved issues¶
Fixes an issue in the
<message_passing>library where messages were limited to only 16 wavelets. The maximum message size is 32 wavelets.Fixes bugs in the
<control>library in whichencode_payload()could index out of bounds, and not setNOCEbit on unused commands.Fixes a bug in which sequential
@mapoperations within a function would not be able to reuse DSRs.
Known issues¶
The
25-pt-stencil,histogram-torus, andspmv-hypersparsebenchmark examples are not yet supported on WSE-3.Instruction traces in the SDK GUI are not yet supported on WSE-3.
The bandwidth of memory transfers saturates at around 8 IO channels.
Version 1.2.0¶
Released 28 June 2024
Note
The Cerebras Wafer-Scale Cluster appliance running Cerebras ML Software 2.2 supports SDK 1.1. See here for SDK 1.1 documentation.
The Cerebras Wafer-Scale Cluster appliance running Cerebras ML Software 2.3 supports SDK 1.2, the current version of SDK software.
New features and enhancements¶
CSL language and compiler enhancements:
Introduces
inlinefor-loops, which are unrolled at compile time. The body of aninlinefor-loop may assign to acomptimevariable. For example:fn length(comptime array: anytype) comptime_int { comptime var result = 0; // This loop will be inlined. inline for (array) |v| { result += 1; } return result; }
Introduces the
@queue_flushand@set_empty_queue_handlerbuiltin for WSE-3. See @queue_flush.Runtime
on_controlvalues in DSD operations are now supported. For example:fn f(out: fabout_dsd, in: fabin_dsd, act_id: local_task_id) void { @fmovh(out, in, .{ .async = true, .on_control = .{ .activate = act_id }}); }
Improves
voidtype semantics, enabling optionally specified module parameters and function arguments.Significantly improves compile times for large programs. Compilation time for full-wafer programs may be improved as much as 10x.
CSL library enhancements:
Introduces a
<simprint>library for runtime debug printing to the simulator log. See <simprint>.Introduces a
<control>library for creating control wavelet payloads. See <control>.Introduces a
<message_passing>library for WSE-3 point-to-point communication. See <message_passing>.Introduces the
queue_flushmodule within the<tile_config>library for WSE-3, which can be used for querying when a queue is flushed and to exit the flushed state. See queue_flush.Adds WSE-3 support to the
collectives_2dlibrary.
SdkRuntimehost runtime enhancements:Adds WSE-3 support for
memcpystreaming mode.
Example programs:
Reorganizes and updates all tutorial example programs with WSE-3 support.
Introduces two new tutorial examples for switches, demonstrating use of the
<control>library. See Topic 6: Switches and Topic 7: Switches and Control Entrypoints.Introduces a new tutorial example to demonstrate the
<simprint>library. See Topic 13: Simprint Library.Introduces a new tutorial example to demonstrate color swapping on WSE-2. See Topic 14: Color Swap.
Adds WSE-3 support to the
wide-multiplication,residual,mandelbrot,gemv-collectives_2d,gemv-checkerboard-pattern,gemm-collectives_2d,7pt-stencil-spmv,bicgstab,conjugateGradient,preconditionedConjugateGradient, andpowerMethodbenchmark example programs.
Resolved issues¶
Adds
memcpystreaming support for WSE-3.Adds WSE-3 support for the
<collectives_2d>library.Fixes potential bug in the
<collectives_2d>library related to reconfiguring the library’s colors.Fixes potential bug in the
<memcpy>library related to reconfiguring the library’s colors.
Known issues¶
The
25-pt-stencil,histogram-torus, andspmv-hypersparsebenchmark examples are not yet supported on WSE-3.The SDK GUI is not yet supported on WSE-3.
The bandwidth of memory transfers saturates at around 8 IO channels.
Deprecations¶
The deprecated
@get_color_idbuiltin to get the numerical value of a color is now removed. Use@get_intinstead.Use of
@get_coloron any ID other than a routable color ID is no longer supported.tile_config.reg_ptrhas been removed. Use@get_configand@set_configfor direct manipulation of config space addresses.
Version 1.1.0¶
Released 10 April 2024
This version of the Cerebras SDK is the first with experimental support for the WSE-3, the third generation Cerebras architecture. The WSE-3 is the wafer-scale processor powering the CS-3 Cerebras system.
Note
The Cerebras Wafer-Scale Cluster appliance running Cerebras ML Software 2.0 supports SDK 0.9. See here for SDK 0.9 documentation.
The Cerebras Wafer-Scale Cluster appliance running Cerebras ML Software 2.1 supports SDK 1.0. See here for SDK 1.0 documentation.
The Cerebras Wafer-Scale Cluster appliance running Cerebras ML Software 2.2 supports SDK 1.1, the current version of SDK software.
New features and enhancements¶
CSL language and compiler enhancements:
Introduces initial support for WSE-3.
Introduces
ut_idtype and@get_ut_idbuiltin for representing microthread IDs. This feature is WSE-3 only.Introduces runtime
@get_configand@set_configsupport.Introduces
i64andu64types, and support in<math>,<debug>, and<malloc>libraries. Likei8andu8, these types are not allowed in memory DSD tensors or@map, nor as arguments to tasks.
CSL
memcpylibrary enhancements:memcpy/get_paramsno longer requires specifying aLAUNCHcolor for host kernel launch support.The
@rpcbuiltin is no longer necessary for host kernel launch support. The RPC server is now created internally.
Other CSL library enhancements:
Introduces
reset_tsc_counter()function in<time>library to clear timestamp counter.enable_tsc()function in<time>library now automatically clears timestamp counter.Introduces
color_configandswitch_configmodules within<tile_config>library for target-independent runtime manipulation of color and switch configurations.The
<tally>library has been updated to support WSE-3. The library API has been updated to require specification of both input and output queues. On WSE-2, the two input queus can be the same as the output queues, but on WSE-3, they must be different. See <tally>.
Example programs:
GEMV tutorials 1 through 8 have been updated to support WSE-2 and WSE-3.
cholesky,FFT,bandwidth-test, andsingle-tile-matvecprograms have been updated to support WSE-2 and WSE-3.Introduces example program to demonstrate WSE-3 features for separation of queue IDs from microthread IDs for asynchronous operations. See Topic 13: WSE-3 Microthreads.
Documentation improvements:
Introduces documentation on WSE-3-specific builtins (see Builtins for WSE-3).
Introduces documentation on microthread semantics for WSE-3 (see Microthread IDs).
Appliance mode enhancements:
Introduces a new
SdkLauncherclass which allows users to stage data onto the appliance before running, and run with the same host code Python script used when running with the Singularity container. This class is particularly useful when transferring large amounts of data onto and off of the CS system. See Running SDK on a Wafer-Scale Cluster.Separates SDK appliance mode functionality into a
cerebras.sdkPython module.
Deprecations¶
Deprecated function
teardown.get_color()in<tile_config>library has been removed. Useteardown.get_task_id()instead.Deprecated
@bind_taskbuiltin has been removed. Use@bind_control_task,@bind_data_task, or@bind_local_taskinstead.Deprecated use of color in
@activate,.activate, on-control.activate, FIFO.activate_push, and FIFO.activate_popis now an error. Uselocal_task_idinstead.Use of integers as queue IDs is now an error. Use
input_queue_idandoutput_queue_idtypes instead.
Resolved issues¶
Fixed bug in which an
ifexpression assigned to a variable where both branches’ values are comptime known, but the condition is not, would crash the compiler.Fixed bug where
<time>library would occasionally incorrectly read the timestamp counter.Fixed bug where DSD operations in which the first operand is a 32-bit scalar could crash at runtime.
Fixed bug where runtime-determined
color,input_queue, oroutput_queuein@get_dsdconfig would crash the compiler.Fixed bug where
.input_queueDSD config field would allowoutput_queuevalues and vice versa.The 1D FFT example program now compiles for
Nz >= 256.
Known issues¶
WSE-3 support is currently experimental. Users may encounter bugs while running WSE-3 programs.
memcpystreaming mode is not yet supported on WSE-3.The
<collectives_2d>library is not yet supported on WSE-3.Only GEMV tutorials 1 through 8 are currently supported on WSE-3.
The SDK GUI is not yet supported on WSE-3.
The bandwidth of memory transfers saturates at around 8 IO channels.
Notes for future releases¶
Use of
@get_coloron any ID other than a routable color ID will be removed in a future release.
Version 1.0.0¶
Released 13 November 2023
Note
The Cerebras Wafer-Scale Cluster appliance running Cerebras ML Software 2.0 supports SDK 0.9. For SDK 0.9 documentation, see here.
New features and enhancements¶
CSL language and compiler enhancements:
Introduces the
data_task_id,local_task_id, andcontrol_task_idtypes, to explicitly differentiate the three types of tasks. Values of these types are created via the new@get_data_task_id,@get_local_task_id, and@get_control_task_idbuiltins, respectively.@get_data_task_idgenerates a task ID from a routablecolor, while@get_local_task_idand@get_control_task_idgenerate task IDs from an integer within the range of allowed IDs. See Task Identifiers and Task Execution for more information on the new task type system.Introduces the
@bind_data_task,@bind_local_task, and@bind_control_taskbuiltins for binding tasks to the corresponding task ID type. Data tasks must take either one or two arguments (corresponding to the contents of a wavelet’s payload), and local tasks must take no arguments.Colors which are used by a
fabin_dsdto receive data and are not explicitly bound to a task no longer need to be blocked at compile time. The initial state of adata_task_idnot explicitly bound to a task is now blocked.Introduces the
@get_intbuiltin to return the numerical value of values of typedata_task_id,control_task_id,local_task_id,color,input_queue, andoutput_queue, as well as values of anyenumor integer type.@get_color_idis now deprecated.@activatebuiltin and.activatefield of builtins on DSDs now take values of typelocal_task_idas an argument. Using@activateor the.activatefield on a value of typecoloris now deprecated..activate_popand.activate_pushfields of FIFOs now take values of typelocal_task_idas an argument. Using these fields on a value of typecoloris now deprecated.@blockand@unblockbuiltins and.unblockfield of builtins on DSDs now take values of typelocal_task_idordata_task_idas arguments.The
@rpcbuiltin now takes values of typedata_task_id. It no longer accepts values of typecolor.Introduces the
cslccompiler flag--warnings-as-errors, to treat compiler warnings as errors.cslccompiler script which launches container to run the compiler now readsCSL_IMPORT_PATHenvironment variable to search additional paths for@import_module.
CSL
memcpylibrary enhancements:The
memcpylibrary has been rewritten to use the new task ID types.
Other CSL library enhancements:
collectives_2dlibrary has been rewritten to use the new task ID types.
SdkRuntimehost runtime enhancements:Introduces new functionality in the
sdk_utilsmodule to simplify data type transformations formemcpy_h2d()andmemcpy_d2h()calls.Introduces new functionality in the
sdk_utilsmodule to process elapsed timestamp data.Introduces
suppress_simfab_traceoption in theSdkRuntimeconstructor to suppress generation ofsimfab_tracesfiles when running.
Example programs:
Example programs have been reorganized, renumbered, and updated.
Introduces three new example programs in the GEMV series, demonstrating more complex communication patterns.
Introduces a series of pipelining example programs to demonstrate the use of
memcpystreamingmode to create a computation pipeline on the WSE.
Documentation improvements:
Introduces new documentation on debugging CSL programs. See Debugging Guide.
Expands installation documentation to include Apptainer for running the SDK container. See Installation and Setup.
Appliance mode enhancements:
For Cerebras Wafer-Scale Clusters running Cerebras ML Software 2.1, the
SdkCompiler::compilefunction now expects an artifact output path, and the function returns a compile artifact path instead of an artifact ID. The compile artifacts are now by default copied back to the user node when compilation finishes.
Deprecations¶
Support for
CSELFRunnerhas now been fully removed. All programs should use theSdkRuntimehost runtime.The
call()function in theSdkRuntimePython host API has been deprecated. Uselaunch()instead, which includes argument type checking.
cslcno longer accepts--channels=0when compiling, as this setting corresponded toCSELFRunnermemcpysupport.The
@get_color_idand@bind_taskbuiltins have been deprecated.Using values of type
colorwith the@activatebuiltin or the.activate,.activate_pop, and.activate_pushfields has been deprecated.The
@rpcbuiltin no longer accepts values of typecolor. Values of typedata_task_idmust be used instead.
Known issues¶
The bandwidth of memory transfers saturates at around 8 IO channels.
When a DSD operation uses an explicit
fabinDSR, the compiler does not bind the color to the associated input queue at runtime. Instead, the user has to bind the color to the input queue explicitly via@initialize_queue. Seepe.cslin 3D 7-Point Stencil SpMV for an example.The 1D FFT example program may fail to compile if
Nz >= 256, triggering an internal compiler exception.
Notes for future releases¶
Using the
@bind_taskbuiltin to bind a task to acoloris now deprecated. This builtin will be removed in a future release. Use@bind_data_taskfor wavelet-triggered data tasks,@bind_local_taskfor self-activated tasks, and@bind_control_taskfor control wavelet-triggered tasks.Using the
@get_color_idbuiltin to get the numerical value of a color is now deprecated. This builtin will be removed in a future release. Use@get_intinstead.Using the
@activatebuiltin on acoloris now deprecated. The ability to do this will be removed in a future release.
Version 0.9.0¶
Released 2 October 2023
New features and enhancements¶
CSL language and compiler enhancements:
@get_tensor_ptris now legal in code that contains no exported symbols, and will compile. If@get_tensor_ptris executed at runtime when no symbols have been exported, then anassert(false)will be hit.Introduces
@has_exported_tensorsbuiltin, which evaluates totrueat comptime if the program contains any exported tensors.Introduces
externkeyword. Theexternstorage class declares that a symbol for a variable or function is expected to be defined in anexportdeclaration elsewhere. See Storage Classes.Introduces
exportkeyword. Theexportstorage class defines a variable or function with a certain name and type, and makes that variable or function available to other object files that are linked with the object being compiled. See Storage Classes.Introduces
linknamekeyword, which can be used to specify the name of the ELF symbol corresponding to the variable. See Syntax of CSL.Introduces support for function pointers. See Syntax of CSL.
Introduces new FIFO DSR types
dsr_fifo_destanddsr_fifo_src, which allow FIFOs to be used with explicit DSRs. See Data Structure Registers.The
booltype is no longer allowed with the@zerosbuiltin.@constantsshould be used instead to initialize an array withfalse.Bitwise not operator
~is no longer allowed on thebooltype.Logical not operator
!is no longer allowed on integer types.Compiler diagnostics for circular dependencies have been improved.
CSL
memcpylibrary enhancements:The
memcpyframework reserves two DSRs,dsr_dest 0anddsr_src1 0, to enable improved performance and reduce resource usage. The user should avoid using these explicit DSRs.The .data_type field is no longer needed when importing
memcpyto support copy mode.
Other CSL library enhancements:
The
collectives_2dlibrary has been rewritten to use explicit DSRs, enabling improved performance and reducing resource usage. By default, the library usesdsr_dest,dsr_src0, anddsr_src1IDs 1 and 2, for the X and Y dimensions, respectively, but can be configured to use other IDs when imported.The input and output queue IDs of
collectives_2dare also now configurable when imported. By default, the X dimension uses queues2and4, and the Y dimension uses queues3and5.The
tile_configlibrary contains a newexceptionssubmodule, which can be used to unmask exceptions. See <tile_config>.
SdkRuntimehost runtime additions:Introduces an
sdk_utilslibrary which includes utility functions to prepare data sent withmemcpy_h2dand process data received frommemcpy_d2h. See SdkRuntime API Reference.
Example programs additions:
Adds
SdkRuntimeversions ofgemv-checkerboard-patternandgemv-collectives, which implement two different approaches for computing GEMV. See GEMV with Checkerboard Pattern and GEMV with Collective Communications.Adds
SdkRuntimeversion ofcholesky, which computes the Cholesky decomposition of a symmetric positive-definite matrix. See Cholesky.Adds additional
SdkRuntimetutorial example programs, including demos of sparse tensor operations, switches, filters, FIFOs, and the@mapbuiltin.See the
csl-examplesGitHub repository for more example programs, including a 1D and 2D FFT,histogram-torus,mandelbrot, andwide-multiplication.
Documentation improvements:
Introduces additional documentation on the
SdkRuntimePython host API, including the newsdk_utilslibrary. See SdkRuntime API Reference.
Resolved issues¶
Fixes crash when compiling pointer to array of non-scalars.
Fixes crash when compiling pointer coercion from multidimensional array to 1D pointer of unknown size.
Fixes LLVM backend bug which previously produced incorrect addresses in certain circumstances, resulting in “Invalid address” errors in the simulator. This in particular could cause issues with the
collectives_2dlibrary.Fixes behavior of CSL
mathlibrary’sisSignaling(x)for checking ifxis a signaling NaN.Fixes a bug where programs using
collectives_2dstall if the width or height of the core rectangle is greater than 160 PEs.The simulator can now support programs with height greater than 256 PEs.
csdbhas been fixed to correctly read core dumps from SDK programs.
Known issues¶
The Singularity image may fail to work on Debian-based Linux distributions. The image works best with a Fedora-based distribution such as Red Hat or Rocky.
The bandwidth of memory transfers saturates at around 8 IO channels.
When a DSD operation uses an explicit
fabinDSR, the compiler does not bind the color to the associated input queue at runtime. Instead, the user has to bind the color to the input queue explicitly via@initialize_queue. Seepe.cslin 3D 7-Point Stencil SpMV for an example.
Notes for future releases¶
The
CSELFRunnerhost runtime has been deprecated. It will be completely removed in a future release.
Version 0.8.0¶
Released 21 June 2023
New features and enhancements¶
Introduces support for Cerebras Wafer-Scale Clusters running in appliance mode. This support is limited to Python host code using the
SdkRuntimehost runtime, and only one SDK compile or execute job can be launched at a time, using no more than one Cerebras system. See Running SDK on a Wafer-Scale Cluster.CSL language and compiler enhancements:
Introduces
@get_output_queuebuiltin for creating output queue types. Using integers for output queue IDs is now deprecated and produces a warning.Introduces additional improvements and enhancements to internal builtins for supporting remote procedure calls (RPCs).
Introduces improved error handling for type casts using the
@asbuiltin.@load_to_dsrnow allow runtime determined colors in the@activateand@unblockfields.The grammar of
inititialize_queuehas been updated. Previously, inititializing a queue with IDqueue_idon colorcolor_idtook the form@initialize_queue(queue_id, color_id);. The new syntax is@initialize_queue(queue_id, .{.color = color_id});.
CSL
memcpylibrary enhancements:The
memcpylibrary can now support multiple types in the same kernel. The user still needs to importmemcpy.cslwith the.data_type =field. The semantic meaning of.data_typeis to enable copy mode for the host runtime.
SdkRuntimehost runtime enhancements:Introduces a
debug_utilslibrary which includesget_symbol,get_symbol_rect, andread_trace, providing parity withCSELFRunner’s debug support. Note that this library is available for simulator runs only.Introduces a
launchfunction, which features type checking and a variable number of arguments for kernel launches with the RPC mechanism. The legacymemcpy_launchfunction has been deprecated, and users should uselaunchinstead.memcpy_d2handmemcpy_h2dnow feature dimension and data type checking for the host tensor.The bandwidth of D2H transfers is greatly improved for systems running in weight streaming mode.
Benchmark programs additions:
Adds
spmv-hypersparseto demonstrate a hypersparse matrix-vector multiplication.Adds
7pt-stencil-spmvto demonstrate a sparse matrix-vector product using a matrix generated by a finite difference seven-point stencil. See 3D 7-Point Stencil SpMV.Adds
bicgstab,powerMethod,conjugateGradient, andpreconditionedConjugateGradientto demonstrate iterative methods on a seven-point stencil. See BiCGSTAB, Power Method, Conjugate Gradient, and Preconditioned Conjugate Gradient.Adds
single-tile-matvec, which benchmarks the performance of single-PE matrix-vector products in terms of aggregate wafer memory bandwidth and FLOPS. See Single Tile Matvec.
Documentation improvements:
Introduces new tutorials for
SdkRuntimebuilt around computing a GEMV.Introduces additional documentation on the
SdkRuntimePython host API. See SdkRuntime API Reference.
Resolved issues¶
When using
SdkRuntime, a nonblockingmemcpy_d2hbeforestop()no longer triggers a segmentation fault.Programs using
SdkRuntimenow load correctly in the SDK GUI.
Known issues¶
The bandwidth of memory transfers saturates at around 8 IO channels.
When a DSD operation uses an explicit
fabinDSR, the compiler does not bind the color to the associated input queue at runtime. Instead, the user has to bind the color to the input queue explicitly via@initialize_queue.
Notes for future releases¶
The
CSELFRunnerhost runtime has been deprecated. It will be completely removed in a future release.
Version 0.7.0¶
Released 17 April 2023
New features and enhancements¶
CSL language and compiler enhancements:
Introduces
@set_teardown_handlerbuiltin which virtualizes the teardown task and allows for separate definitions of teardown operations for different colors.Introduces
@rpcbuiltin which automatically generates RPC interpreter for exported functions. Used with thecallhost function added toSdkRuntime. Note that exported symbols may not have struct or enum types, and exported function may have at most 15 parameters.Introduces
@get_input_queuebuiltin for creating input queue types. Using integers for input queue IDs is now deprecated and produces a warning.Variables now have a
linksectionattribute. With the--link-section-address-bytesflag, this allows global variables to be placed at a specific address.Introduces
control_transformfield for DSDs to transform the index portion of control wavelets.Introduces
@dfiltbuiltin which instructs an input queue to drop all data wavelets until a certain number of control wavelets are encountered.DSD
.activatefield now allows a runtime-determined color value.Deprecated color config syntax has been removed.
Compiler task table packing optimization increases performance of small tasks.
CSL library enhancements:
tile_configlibrary introducescontrol_transformsubmodule to set mask when transforming index portion of control wavelets.collectives_2dlibrary now uses the virtualized teardown task, allowing for interoperability with programs that usememcpyand theSdkRuntimehost runtime.
SdkRuntimehost runtime enhancements:SdkRuntimeintroduces acallfunction to greatly simplify kernel launches with the RPC mechanism. Functions exported in device code with the@rpcbuiltin are now directly host-callable.memcpylibrary now supports 16-bit for copy mode.memcpylibrary now reserves color 27 to deliver better performance.Both
copyandstreamingmode now support 16-bit data. Note that instreamingmode, theMemcpyDataTypeparameter inmemcpy_h2dandmemcpy_d2hhost calls has no effect, and the user must handle the data appropriately in the receiving wavelet-triggered task.The
memcpy_h2dandmemcpy_d2hhost functions take an argument to specify the packing of the 3D input/output tensor into a 1D array, either row-major or column-major. The column-major option improves bandwidth of data transfers when the host data is packed in that order.The
memcpy_h2dandmemcpy_d2hhost functions have new function signatures to better handle the increased number of transfer type arguments. These are passed in astructin the C++ interface, or as requiredkwargsin the Python interface. This release supports the following options:DataType: (new option) 16-bit or 32-bitOrder: (new option) row-major or column-majorstreaming: true or falsenonblock: true or false
The runtime can seamlessly aggregate consecutive nonblocking
memcpy_h2dcalls, improving the bandwidth of bursts of small transfers.
Benchmark programs additions and enhancements:
Adds
bandwidth-testto benchmark data transfer performance between host and device. See Bandwidth Test.Adds a version of
gemm-collectives_2dusingSdkRuntime, which showcases the interoperability of thecollectives_2dlibrary withmemcpy. See GEMM with Collective Operations.Benchmark programs written with
SdkRuntimeand using the RPC mechanism to launch device kernels have been rewritten to usecallin the host code and the@rpcbuiltin in the device code, greatly reducing the complexity of the programs.
Documentation improvements:
Example programs have been reorganized into
CSELFRunnerandSdkRuntimesections, to clearly differentiate programs by their host runtime.Adds appendix to describe SIMD operations on DSDs. See SIMD Mode.
Adds five tutorial example programs using
SdkRuntime, mirroring those written to useCSELFRunner.Adds improved documentation on
SdkRuntimeand its host API.
Resolved issues¶
Runtime expressions with
comptime-only types in comparisons no longer crash the compiler.comptimeswitch expressions can now switch oncomptime_int.Binding more than one task to the same color now produces a compiler error.
Compiler now checks that dimensionality of a tensor access expression does not exceed max dimensionality of type.
Known issues¶
Programs using the
SdkRuntimehost runtime may fail to load in thesdk-guiwhen invoked withsdk_debug_shell visualize.The bandwidth of D2H (device to host) memory transfers using
memcpyare about 7x to 8x slower than H2D (host to device).The bandwidth of memory transfers saturates at around 8 IO channels.
When a DSD operation uses an explicit
fabinDSR, the compiler does not bind the color to the associated input queue at runtime. Instead, the user has to bind the color to the input queue explicitly via@initialize_queue.When using
SdkRuntime, if the last call beforestop()is a nonblockingmemcpy_d2h, thenstop()may trigger a segmentation fault.
Notes for future releases¶
The
CSELFRunnerruntime will be deprecated in a future release. Code should be ported to theSdkRuntimeruntime.Using integers for input queue IDs is now deprecated and will be removed in a future release.
Version 0.6.0¶
Released 22 December 2022
New features and enhancements¶
Compile times are improved due to enhanced caching support.
Introduces a new host-side runtime,
SdkRuntime, with greatly improved host-to-device and device-to-host data transfer performance.Supports host-to-device (H2D) copy to a device CSL variable address (
memcpy_h2d), device-to-host (D2H) copy from a device CSL variable address (memcpy_d2h), and launch of CSL device kernels (memcpy_launch).See Host Runtime and Tensor Streaming for more details. For examples using the new API, see Residual and 25-Point Stencil.
The legacy runtime,
CSELFRunner, now supports host-to-device and device-to-host copy using the memcpy API.CSL language enhancements:
Support for normal-mode FIFOs.
Introduces explicit DSRs, providing a more efficient way to execute DSD operations.
Initial RPC (remote procedure call) support, with a mechanism for host-device communication using shared symbols.
Additional support for DSD-to-scalar operations.
Support for setting task and microthread priority at comptime and runtime.
Improved assertion failure messages in
@comptime_assert.The
.unblockDSD field can now be used at runtime and comptime.
CSL library enhancements:
Introduces
collectives_2dlibrary, which implements MPI-like communication primitives over rows or columns of PEs.New generic API for math libraries.
Introduces
directionslibrary, which provides utility functions for manipulating directions.Adds efficient implementations of
sin_f16andcos_f16.Adds
issignaling_f16andissignaling_f32, which check for signalling NaN.A new version of the
memcpylibrary supports copies to/from address, and updates to support new runtime. See Residual and 25-Point Stencil examples.
cs_readelfimprovements:Adds
--visualizecommand line option for drawing ASCII art representation of PE populations. See--helpinformation for details.All addresses (both command line option inputs and printed outputs) are now in byte (8-bit) units instead of word (16-bit) units.
New benchmark programs:
Dense Cholesky decomposition.
Hadamard product, demonstrating selective batched execution mode.
GEMV with collective communications, demonstrating the
collectives_2dlibrary.
Documentation improvements:
Adds a new introductory tutorials section to provide step-by-step instruction for learning CSL. See Tutorials.
Adds new example demonstrating the use of the
debuglibrary for tracing values at runtime.Adds sections on generics and DSRs. See Generics and Data Structure Registers.
Resolved issues¶
Relative paths are now handled correctly when importing code files as modules.
Known issues¶
The copy mode of
memcpyonly supports 32-bit data. To copy 16-bit data to the device, streaming mode must be used instead.If there are two device-to-host (D2H)
memcpycalls in a non-blocking sequence, and the first D2H is non-blocking, then the run can stall, especially when using back-to-back D2H calls. To avoid this risk, the user must use blocking D2H calls instead.
Notes for future releases¶
The
CSELFRunnerruntime will be deprecated in a future release. Code should be ported to theSdkRuntimeruntime.
Version 0.5.1¶
Released 27 September 2022
New features and enhancements¶
An optional new implementation for tensor streaming is available. The new implementation is described in Host Runtime and Tensor Streaming, along with instructions for porting kernels to use the new implementation. Two new CSL code examples, Residual and 25-Point Stencil, are provided for reference.
The SDK GUI has introduced new features, detailed in SDK GUI. Major new features include:
Updated display of routing.
Addition of instruction tracing in the timeline.
CSL language enhancements:
Runtime support for named struct types.
switchsupport.comptimeandanytypefunction argument support.comptime_stringsupport.Either color or task can now be used for DSD config operations.
CSL library enhancements:
Initial complex number support.
Runtime support for finding the position of the running PE within the rectangle.
Version 0.4.0¶
Released 29 April 2022
New features and enhancements¶
New CLI tool
csdbintroduced.csdbcurrently supports debugging on hardware and will eventually support simulation debugging.New CLI tool
cs_readelfintroduced.As of 0.3.1, the numbers in the ELF binary names do NOT correspond to PE coordinates.
To access prior versions of SDK documentation, please email
developer@cerebras.net.
Known issues¶
In the SDK GUI timeline view, clicking multiple PEs on the grid in quick succession may result in a JSON error. To avoid this error, please wait for the timeline to load before clicking the next PE. If you see this error for a PE, click a different PE, allow the timeline to load, and then click the original PE again.
If you launch
csdband typectrl+x, the container will lock up and prevent further action. If this happens, you must exit and re-launch your terminal session.cslc --helpreturns options forcslc-driver, which are very similar tools, but not exactly the same. Please note that some options listed may not be available incslc.
Notes for future releases¶
csdbCLIs will replacesdk_debug_shellCLIs in a future release.sdk_debug_shellwill be deprecated.Content under
CSL Code Exampleswill be move to thecsl-examplesGitHub repository in a future release. Please let us know if you need access to this repository by emailingdeveloper@cerebras.net.
Version 0.3.1¶
Released 25 February 2022
New features and enhancements¶
Compile time is faster now due to caching improvements.
Support for FIFOs is added. See Data Structure Descriptors for documentation and
@allocate_fifoin Builtins.See Topic 9: FIFOs for an example showing how to use
@allocate_fifo.
Support for switching and filtering is added. With this feature, you can specify the routing configuration for a specific color at a specific processing element (PE). This can be done in a layout block (
@set_color_config) or in a processing element’s top-levelcomptimeblock (@set_local_color_config). See Builtins for documentation.See Topic 6: Switches and Topic 8: Filters for examples.
Support for microthreads is added. See Data Structure Descriptors for documentation.
Library support is added. See Libraries for a full list of supported library functions.
Added the following built-ins. See Builtins for a full list of supported built-ins.
@set_dsd_base_addr@random16@is_same_type@is_comptime
Compile time floating point constants are now automatically type-casted as needed. So, instead of
@as(f32, 1.0)(see Builtins) or@as(f16, 1.0), simply write1.0.Runtime floating point constants no longer default to type
f16but tocomptime_float. If you want a runtime variable, you now need to explicitly specify the desired type of that variable. For example, instead ofvar x = 0.0;(wrong), writevar x: f16 = 0.0;.Adds support for setting the state of the pseudo-random number generator (PRNG).
Adds support for using general purpose registers (GPRs) as destination for DSD operations:
var result: f16 = 1.0; const buffer = [3]f16 {100.0, 250.0, 349.0}; task fooTask() void { const dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{3} -> buffer[i] }); @faddh(&result, result, dsd); }
Asynchronous DSD operations must have at least one fabric DSD operand. Non-compliant code will now trigger an error message.
Adds support for the dot operator to access members of structs. Implemented for compile time only.
Colors can now be compared using
==and!=operators.DSD operations, for example,
add16, now support unsigned integer operands.A new
--verbosecompiler flag shows progress.
Requirements and unsupported features¶
The SDK requires that the overlay filesystem functionality is available on your Linux system.
This SDK is supported only on Linux systems.
There are no guarantees for forward- or backward-compatibility for this release.
The SDK does not support running external Python scripts in the Singularity container.
The SDK only supports running the versions of packages provided in the Singularity container.
Resolved issues¶
Fixes a bug that prevented unit innermost dimension loops in
mem4d_dsds.Fixes a bug so that
mem4d_dsdsis now allowed to set thewavelet_index_offsetbit.Compile time and runtime semantics of
set_dsd_base_addr(see Builtins) were different. This is fixed and now they are the same.
Known issues¶
When using the SDK GUI, via
sdk_debug_shell visualize --artifact_dircommand, if the artifacts in the artifact directory change, then the SDK GUI will continue to show the old artifact data in a cache. To view the new artifacts, restart the SDK GUI by running the commandsdk_debug_shell visualize --artifact_dir.When you run the command
sdk_debug_shell visualize --artifact_dirto invoke the SDK GUI, you will see the following error message. This message can be safely ignored.$ sdk_debug_shell visualize --artifact_dir /cb/cold/user1/sandbox/sdk_tool_rel-0.3.1/residual WARNING:cerebras.common.decorators:Call to deprecated function EnumFiles WARNING:root: . is not a valid workdir. ERROR:root:plan.meta not found in current directory or subdirectories. ERROR:root:No entries will be displayed. Click this link to open URL: http://user1:8000/?session_id=12b77f285e Click this link to open URL: http://172.xx.51.216:8000/?session_id=12b77f285e Press Ctrl-C to exit ERROR:root:Error reading A_1_1.elf ERROR:root:Error reading A_0_1.elf ERROR:root:Error reading A_1_0.elf ERROR:root:Error reading A_0_0.elf
The SDK GUI currently displays the color values only in the range of 0-14 inclusive.
Version 0.2.1¶
Released 5 November 2021
This release adds usability improvements and fixes bugs encountered in the 0.2.0 debug tool CLIs. This release also adds compatibility with the Cerebras R0.9 Software Release, so the CS system hardware does not require re-imaging in order to use the SDK.
This SDK is supported only on Linux systems.
There are no guarantees for forward- or backward-compatibility for this release.
The SDK requires that the overlay filesystem functionality is available on your Linux system.
The SDK only supports running the versions of packages provided in the Singularity container.
If the CSL compiler aborts with the LLVM error message
"PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script."then do not report to llvm.org but instead report the problem to Cerebras.The visualizer tool will not display single-ended routes, i.e., routes where PE A transmits to PE B, but PE B is missing a receiving route, and vice-versa.
CSELFRunnersupports single-node host only.We are no longer actively supporting or maintaining the CASM and Spoke workflow of version 0.1.x. Migration of code to CSL is needed.
The following examples in the
cslang/benchmarksdirectory of the SDK can be run only in simulation, and not on the CS system:cslang/benchmarks/FFTcslang/benchmarks/wide-multiplication
The
cslang/benchmarks/FFTexample incorrectly states “SUCCESS” on test completion.To run the CSL examples on the CS-1 you must manually emit wavelet to terminate the runtime.
Version 0.2.0¶
Released 12 October 2021
This SDK is supported only on Linux systems.
There are no guarantees for forward- or backward-compatibility for this release.
The SDK 0.2.0 requires that the overlay filesystem functionality is available on your Linux system.
Hardware support for SDK 0.2.0 is limited to the CS-1.
The SDK does not support running external Python scripts in the Singularity container.
The SDK only supports running the versions of packages provided in the Singularity container.
The SDK 0.2.0 image for the CS-1 is incompatible with the Cerebras Graph Compiler (CGC) 0.9.0 image. Hence, the SDK system image must be loaded in order to run CSL programs on the CS-1 system.
The visualizer tool will not display single-ended routes, i.e., routes where PE A transmits to PE B, but PE B is missing a receiving route, and vice-versa.
CSELFRunner supports single-node host only.
The following examples in the
cslang/benchmarksdirectory of the SDK can be run only in simulation, and not on the CS system:cslang/benchmarks/FFT.cslang/benchmarks/wide-multiplication.
Pre-release Version 0.2.0¶
Released 27 August 2021
Initial availability of the Pre-release 0.2.0 of the SDK documentation.