.. _sdkruntime-api-reference:

SdkRuntime API Reference
========================

This section presents the ``SdkRuntime`` Python host API reference and
associated utilities to develop kernels for the Cerebras Wafer Scale Engine.


SdkRuntime
----------

.. py:module:: cerebras.sdk.runtime.sdkruntimepybind

Python API for ``SdkRuntime`` functions.

.. py:class:: SdkRuntime(bindir: Union[pathlib.Path, str], **kwargs)
   :module: cerebras.sdk.runtime.sdkruntimepybind

   Bases: :class:`object`

   Manages the execution of SDK programs on the Cerebras Wafer Scale Engine
   (WSE) or simfabric. The constructor analyzes the WSE ELFs in the ``bindir``
   and prepares the WSE or simfabric for a run.
   Requires CM IP address and port for WSE runs.

   :param bindir: Path to ELF files which is compiled by ``cslc``.
                  The runtime collects the I/O and fabric parameters
                  automatically, including height, width, number of channels,
                  width of buffers,... etc.
   :type bindir: ``Union[pathlib.Path, str]``

   :Keyword Arguments:
       * **cmaddr** (``str``) --
         ``'IP_ADDRESS:PORT'`` string of CM. Omit this ``kwarg`` to run on
         simfabric.

       * **suppress_simfab_trace** (``bool``) --
         If ``True``, suppresses generation of ``simfab_traces`` when running.
         Default value is ``False``, i.e., ``simfab_traces`` are produced.

       * **simfab_numthreads** (``int``) --
         Number of threads to use if running on simfabric.
         Maximum value is ``64``. Default value is ``5``, i.e., the simulator
         uses 5 threads.

       * **msg_level** (``str``) --
         Message logging output level. Available output levels are ``DEBUG``,
         ``INFO``, ``WARNING``, and ``ERROR``. Default value is ``WARNING``.

   **Example**:

      In the following example, an ``SdkRuntime`` runner object is
      instantiated. If ``args.cmaddr`` is non-empty, then the kernel code will
      run on the WSE pointed to by that address; otherwise, the kernel code
      will run on simfabric. The compiled kernel code in the directory
      ``args.name`` has exported symbols ``A`` and ``B`` pointing to arrays on
      the device. After loading the code and starting the run with ``load()``
      and ``run()``, data on the host stored in ``data`` is copied to ``A``
      on the device, and then ``B`` on the device is copied back into ``data``
      on the host.

      .. code-block:: python

         runner = SdkRuntime(args.name, cmaddr=args.cmaddr)
         symbol_A = runner.get_id("A")
         symbol_B = runner.get_id("B")
         runner.load()
         runner.run()
         runner.memcpy_h2d(symbol_A, data, px, py, w, h, l,
                           streaming=False, data_type=memcpy_dtype,
                           order=memcpy_order, nonblock=False)
         runner.memcpy_d2h(data, symbol_B, px, py, w, h, l,
                           streaming=False, data_type=memcpy_dtype,
                           order=memcpy_order, nonblock=False)


   .. py:method:: coord_logical_to_physical(logical_coords: (int, int))
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Convert a logical coordinate to a physical coordinate.
      For a program with fabric offsets (``offset_x``, ``offset_y``),
      and program rectangle coordinate (``x``, ``y``), this function
      returns (``offset_x + x``, ``offset_y + y``).

      :param logical_coords: Tuple containing logical coordinates.
      :type symbol: ``(int, int)``

      :returns:

          **physical_coords** (``(int, int)``) --
          Tuple containing physical coordinates.


   .. py:method:: dump_core(corefile: str)
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Dump the core of a simulator run, to be used
      for debugging with ``csdb``. Note that the specified name of
      the corefile MUST be "corefile.cs1" to use with ``csdb``, and
      this method can only be called after calling ``stop()``.

      :param corefile: Name of corefile. Must be "corefile.cs1" to use with ``csdb``.
      :type symbol: ``str``


   .. py:method:: get_id(symbol: str)
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Retrieve the integer representation of an exported symbol which is
      exported in the kernel. Possible symbols include a data tensor or a
      host-callable function.

      :param symbol: The exported name of the symbol.
      :type symbol: ``str``


   .. py:method:: is_task_done(task_handle: Task) -> bool
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Query if task ``task_handle`` is complete

      :param task_handle: Handle to a task previously launched by
                          ``SdkRuntime``.
      :type task_handle: ``Task``

      :returns:

          **task_done** (``bool``) --
          ``True`` if task is done, and ``False`` otherwise.


   .. py:method:: launch(symbol: str, *args, **kwargs) -> Task
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Trigger a host-callable function defined in the kernel, with type
      checking for arguments.

      :param symbol: The exported name of the symbol corresponding to a
             host-callable function.
      :type symbol: ``str``

      :Positional Arguments:
          Matches the arguments of the host-callable function. ``launch``
          will perform type checking on the arguments.

      :Keyword Arguments:
          **nonblock** (``bool``) --
          Nonblocking if ``True``, blocking otherwise.

      :returns:

          **task_handle** (``Task``) --
          Handle to the task launched by ``launch``.

      **Example**:

         Consider a kernel which defines a host-callable function ``fn_foo``
         by:

         .. code-block:: csl

            comptime {
              @export_symbol(fn_foo);
            }

         The host calls ``fn_foo`` by
         ``runner.launch("fn_foo", nonblock=False)``.


   .. py:method:: load()
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Load the binaries to simfabric or WSE. It may takes 80+ seconds to load
      the binaries onto the WSE.


   .. py:method:: memcpy_d2h(dest: numpy.ndarray, src: int, px: int, py: int, w: int, h: int, elem_per_pe: int, **kwargs) -> Task
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Receive a host tensor to the device via either copy mode or streaming
      mode. The data is distributed into the region of interest (ROI) which is
      a bounding box starting at coordinate ``(px, py)`` with width ``w`` and
      height ``h``.

      :param dest: A 3-D host tensor ``A[h][w][l]``, wrapped in a 1-D array
             according to keyword argument ``order``.
      :type dest: ``numpy.ndarray``
      :param src: A user-defined color if keyword argument ``streaming=True``,
             symbol of a device tensor otherwise.
      :type src: ``int``
      :param px: ``x``-coordinate of start point of the ROI.
      :type px: ``int``
      :param py: ``y``-coordinate of start point of the ROI.
      :type py: ``int``
      :param w: Width of the ROI.
      :type w: ``int``
      :param h: Height of the ROI.
      :type h: ``int``
      :param elem_per_pe: Number of elements per PE.
             The data type of an element is 16-bit and 32-bit only.
             If the tensor has ``k`` elements per PE, ``elt_per_pe`` is ``k``
             even if the data type is 16-bit.
             If the data type is 16-bit, the user has to extend the tensor to
             a 32-bit one, with zero filled in the higher 16 bits.
      :type elem_per_pe: ``int``

      :Keyword Arguments:
          * **streaming** (``bool``) --
            Streaming mode if ``True``, copy mode otherwise.
          * **data_type** (``MemcpyDataType``) --
            32-bit if ``MemcpyDataType.MEMCPY_32BIT`` or 16-bit if
            ``MemcpyDataType.MEMCPY_16BIT``.
            Note that this argument has no effect if ``streaming`` is ``True``,
            and the user must handle the data appropriately in the receiving
            wavelet-triggered task.
            Additionally, the underlying type of the tensor ``dest`` must be
            32-bit. The tensor must be extended to a 32-bit one with zero
            filled in the higher 16 bits.
          * **order** (``MemcpyOrder``) --
            Row-major if ``MemcpyOrder.ROW_MAJOR`` or column-major if
            ``MemcpyOrder.COL_MAJOR``.
          * **nonblock** (``bool``) --
            Nonblocking if ``True``, blocking otherwise.

      :returns:

          **task_handle** (``Task``) --
          Handle to the task launched by ``memcpy_d2h``.


   .. py:method:: memcpy_h2d(dest: int, src: numpy.ndarray, px: int, py: int, w: int, h: int, elem_per_pe: int, **kwargs) -> Task
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Send a host tensor to the device via either copy mode or streaming mode.
      The data is distributed into the region of interest (ROI) which is a
      bounding box starting at coordinate ``(px, py)`` with width ``w`` and
      height ``h``.

      :param dest: A user-defined color if keyword argument ``streaming=True``,
             symbol of a device tensor otherwise.
      :type dest: ``int``
      :param src: A 3-D host tensor ``A[h][w][l]``, wrapped in a 1-D array
             according to parameter ``order``.
      :type src: ``numpy.ndarray``
      :param px: ``x``-coordinate of start point of the ROI.
      :type px: ``int``
      :param py: ``y``-coordinate of start point of the ROI.
      :type py: ``int``
      :param w: Width of the ROI.
      :type w: ``int``
      :param h: Height of the ROI.
      :type h: ``int``
      :param elem_per_pe: Number of elements per PE.
             The data type of an element is 16-bit and 32-bit only.
             If the tensor has ``k`` elements per PE, ``elt_per_pe`` is ``k``
             even if the data type is 16-bit.
             If the data type is 16-bit, the user has to extend the tensor to
             a 32-bit one, with zero filled in the higher 16 bits.
      :type elem_per_pe: ``int``

      :Keyword Arguments:
          * **streaming** (``bool``) --
            Streaming mode if ``True``, copy mode otherwise.
          * **data_type** (``MemcpyDataType``) --
            32-bit if ``MemcpyDataType.MEMCPY_32BIT`` or 16-bit if
            ``MemcpyDataType.MEMCPY_16BIT``.
            Note that this argument has no effect if ``streaming`` is ``True``,
            and the user must handle the data appropriately in the receiving
            wavelet-triggered task.
            Additionally, the underlying type of the tensor ``src`` must be
            32-bit. The tensor must be extended to a 32-bit one with zero
            filled in the higher 16 bits.
          * **order** (``MemcpyOrder``) --
            Row-major if ``MemcpyOrder.ROW_MAJOR`` or column-major if
            ``MemcpyOrder.COL_MAJOR``.
          * **nonblock** (``bool``) --
            Nonblocking if ``True``, blocking otherwise.

      :returns:

          * **task_handle** (``Task``) --
            Handle to the task launched by ``memcpy_h2d``.


   .. py:method:: run()
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Start the simfabric or WSE run and wait for commands from the host
      runtime.


   .. py:method:: stop()
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Wait for all pending commands (data transfers and kernel function calls)
      to complete and then stop simfabric or WSE. After this call is complete,
      no new commands will be accepted for this ``SdkRuntime`` object.

      ``stop`` must be called to end a program. Otherwise, the runtime will
      emit an error.


   .. py:method:: task_wait(task_handle: Task)
      :module: cerebras.sdk.runtime.sdkruntimepybind

      Wait for the task ``task_handle`` to complete.

      :param task_handle: Handle to a task previously launched by
                          ``SdkRuntime``.
      :type task_handle: ``Task``


.. py:class:: MemcpyDataType
   :module: cerebras.sdk.runtime.sdkruntimepybind

   Bases: :class:`Enum`

   Specifies the data size for transfers using ``memcpy_d2h`` and
   ``memcpy_h2d`` copy mode.

   :Values:
       * **MEMCPY_16BIT**
       * **MEMCPY_32BIT**

.. py:class:: MemcpyOrder
   :module: cerebras.sdk.runtime.sdkruntimepybind

   Bases: :class:`Enum`

   Specifies mapping of data for transfers using ``memcpy_d2h`` and
   ``memcpy_h2d``.

   :Values:
       * **ROW_MAJOR**
       * **COL_MAJOR**

.. py:class:: Task
   :module: cerebras.sdk.runtime.sdkruntimepybind

   Handle to a task launched by ``SdkRuntime``.


.. _sdkruntime-sdk-utils:

sdk_utils
---------

Utility functions for common operations with ``SdkRuntime``.

.. py:module:: cerebras.sdk.sdk_utils

.. py:function:: calculate_cycles(timestamp_buf: numpy.ndarray) -> numpy.int64:
   :module: cerebras.sdk.sdk_utils

   Converts values in ``timestamp_buf`` returned from device into a human-readable
   elapsed cycle count.

   :param timestamp_buf: array returned from device containing elapsed timestamp data
   :type timestamp_buf: ``numpy.ndarray``

   :returns:

       **elapsed_cycles** (``numpy.int64``) --
       Elapsed cycle count.

   **Example**:

      Consider the following CSL snippet which records timestamps and produces a single
      array to copy back to the host, to generate an elapsed cycle count:

      .. code-block:: csl

         // import time module and create timestamp buffers
         const timestamp = @import_module("<time>");
         var tsc_end_buf = @zeros([timestamp.tsc_size_words]u16);
         var tsc_start_buf = @zeros([timestamp.tsc_size_words]u16);

         // create elapsed timer buffer and advertise to host
         var timer_buf = @zeros([3]f32);
         var ptr_timer_buf: [*]f32 = &timer_buf;

         timestamp.enable_tsc();
         // record starting timestamp
         timestamp.get_timestamp(&tsc_start_buf);

         // perform some operation for which you want to calculate elapsed cycles

         // record ending timestamp
         timestamp.get_timestamp(&tsc_end_buf);
         timestamp.disable_tsc();

         var lo_: u16 = 0;
         var hi_: u16 = 0;
         var word: u32 = 0;

         lo_ = tsc_start_buf[0];
         hi_ = tsc_start_buf[1];
         timer_buf[0] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );

         lo_ = tsc_start_buf[2];
         hi_ = tsc_end_buf[0];
         timer_buf[1] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );

         lo_ = tsc_end_buf[1];
         hi_ = tsc_end_buf[2];
         timer_buf[2] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );

      Then the elapsed cycles can be calculated on the host with:

      .. code-block:: python

         # Get symbol for timer_buf on device
         symbol_timer_buf = runner.get_id("timer_buf")

         # Copy back timer_buf from all width x height PEs
         data = np.zeros((width*height*3, 1), dtype=np.uint32)
         runner.memcpy_d2h(data, symbol_timer_buf, 0, 0, width, height, 3, streaming=False,
           data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
         elapsed_time_hwl = data.view(np.float32).reshape((height, width, 3))

         # Print elapsed cycles for each PE
         for pe_x in range(width):
           for pe_y in range(height):
             cycle_cnt = sdk_utils.calculate_cycles(elapsed_time_hwl[pe_y, pe_x, :])
             print("Elapsed cycles on PE ", pe_x, ", ", pe_y, ": ", cycle_cnt)


.. py:function:: input_array_to_u32(arr: numpy.ndarray, sentinel: Optional[int], fast_dim_sz: int) -> numpy.ndarray
   :module: cerebras.sdk.sdk_utils

   Converts a 16-bit tensor to a 32-bit tensor of type ``u32`` for use with ``memcpy``.
   The parameter ``sentinel`` distiguishes two different extensions of 16-bit data.
   If ``sentinel`` is ``None``, zero-pad the upper 16 bits.
   If ``sentinel`` is not ``None``, pack the index of the innermost dimension of the array
   into the upper 16-bits.

   :param arr: A numpy array with 2 or 4 bytes per element.
   :type arr: ``numpy.ndarray``
   :param sentinel: For 16-bit input data, if this parameter is not ``None``,
                    pack the index of the innermost dimension into
                    the high bits of the 32-bit wavelet.
                    If sentinel is None, then the high bits are zeros.
   :type sentinel: ``Optional[int]``
   :param fast_dim_sz: If ``sentinel`` is not ``None``, specifies size of fastest-changing
                       dimension for generating the index.
   :type fast_dim_sz: ``int``

   :returns:

       **output_view** (``numpy.ndarray.view``) --
       Numpy view into ``arr`` with specified numpy data type.


.. py:function:: memcpy_view(arr: numpy.ndarray, datatype: numpy.dtype) -> numpy.ndarray.view
   :module: cerebras.sdk.sdk_utils

   Returns a 32, 16 or 8 bit view of a 32 bit numpy array
   (only the lower 16 or 8 bits of each 32 bit word in the
   last two cases).

   :param arr: A numpy array with 4 bytes per element on which the numpy view will be created.
   :type arr: ``numpy.ndarray``
   :param datatype: The numpy data type which should be used in the output view.
                    The itemsize must be 1, 2, or 4 bytes.
   :type datatype: ``numpy.dtype``

   :returns:

       **output_view** (``numpy.ndarray.view``) --
       Numpy view into ``arr`` with specified numpy data type.

   **Example**:

     ``memcpy_view`` simplifies the use of various precision data types when copying
     between host and device. Consider the following Python host code which creates
     a ``float16`` view into a numpy array. Note that this array *must* be 32-bit.
     The user can fill the array with ``float16`` data,
     and copy it to an array on the device with CSL data type ``f16``.

     .. code-block:: python

         x_symbol = runner.get_symbol('x')
         # This container array must be 32-bit
         x_container = np.zeros(N, dtype=np.uint32)

         x = sdk_utils.memcpy_view(x_container, np.float16)
         x.fill(0.5)

         runner.memcpy_h2d(x_symbol, x_container, 0, 0, 1, 1, N,
                     streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT,
                     order=MemcpyOrder.ROW_MAJOR, nonblock=False)


debug_util
----------

Utilities for parsing debug output and core files of a simulator run.

.. py:module:: cerebras.sdk.debug.debug_util

.. py:class:: debug_util(bindir: Union[pathlib.Path, str])
   :module: cerebras.sdk.debug.debug_util

   Bases: :class:`object`

   Loads ELF files  in ``bindir`` in order to dump symbols for debugging.

   The user does not need to export the symbols in the kernel. ``debug_util``
   dumps the core and looks for the symbols in the ELFs. If the symbol at
   ``Px.y`` is not found in the corresponding ELF, ``debug_util``
   emits an error.

   The most common errors are either: 1) a wrong coordinate passed in
   ``get_symbol()``, or 2) a correct coordinate, but the symbol has been
   removed due to compiler optimization. One can use ``readelf`` to check if
   the symbol exists or not. If not, the user can export the symbol in the
   kernel to keep the symbol in the ELF.

   The functionality of this class is only supported in the simulator.

   **Example**:

      .. code-block:: python

          from cerebras.sdk.debug.debug_util import debug_util

          # run the app
          # dirname is the path to ELFs
          simulator = SdkRuntime(dirname)
          simulator.load()
          simulator.run()
          ...
          simulator.stop()

          # retrieve symbols after the run
          debug_mod = debug_util(dirname)
          # assume the core rectangle starts at P4.1, the dimension is
          # width-by-height and we want to retrieve the symbol y for every PE
          core_offset_x = 4
          core_offset_y = 1
          for py in range(height):
            for px in range(width):
              t = debug_mod.get_symbol(core_offset_x+px, core_offset_y+py, 'y', np.float32)
              print(f"At (py, px) = {py, px}, symbol y = {t}")


   .. py:method:: get_symbol(col: int, row: int, symbol: str, dtype: numpy.dtype) -> numpy.ndarray

      Read the value of ``symbol`` of given type at given PE coordinates.
      Note that each call to this function scans the whole fabric, so prefer
      ``get_symbol_rect`` over calling this in a loop.

      :param px: ``x``-coordinate of the PE, indexed from the northwest corner
                 of the entire fabric (NOT the program rectangle)
      :type px: ``int``
      :param py: ``y``-coordinate of the PE, indexed from the northwest corner
                 of the entire fabric (NOT the program rectangle)
      :type py: ``int``
      :param symbol: Name of the symbol to be read.
      :type symbol: ``str``
      :param dtype: Numpy data type of values contained by symbol.
      :type dtype: ``numpy.dtype``

      :returns:

          **output_arr** (``numpy.ndarray``) --
          Numpy array of output values read at symbol.


   .. py:method:: get_symbol_rect(rectangle: Rectangle, symbol: str, dtype: numpy.dtype) -> numpy.ndarray

      Read the value of ``symbol`` of given type for a rectangle of PEs.

     :param rectangle: Rectangle specified as ``((col, row), (width, height))``,
                       indexed from the northwest corner of the entire fabric
                       (NOT the program rectangle)
     :type rectangle: ``Rectangle``
     :param symbol: Name of the symbol to be read.
     :type symbol: ``str``
     :param dtype: Numpy data type of values contained by symbol.
     :type dtype: ``numpy.dtype``

     :returns:

         **output_arr** (``numpy.ndarray``) --
         Numpy array of output values read at symbol.
         The first two dimensions of the returned array are PE coordinates
         ``(column, row)`` relative to the rectangle.


   .. py:method:: read_trace(px: int, py: int, name: str) -> list

      Parse a CSL trace buffer with name ``name`` at the given PE coordinates.

      :param px: ``x``-coordinate of the PE, indexed from the northwest corner
                 of the entire fabric (NOT the program rectangle)
      :type px: ``int``
      :param py: ``y``-coordinate of the PE, indexed from the northwest corner
                 of the entire fabric (NOT the program rectangle)
      :type py: ``int``
      :param name: Name of the trace buffer to be read.
      :type name: ``str``

      :returns:

          **trace_output** (``list``) --
          Heterogenous list of trace values.

      **Example**:

         Consider a device kernel which initializes a trace buffer with the CSL
         ``debug`` library and uses it to record values:

         .. code-block:: csl

            const debug_mod = @import_module("<debug>", .{.key = "my_trace", .buffer_size = 100});

            fn foo() void {
              debug_mod.trace_timestamp();
              debug_mod.trace_string("Bar");
              debug_mod.trace_i16(1);
            }

         Then the trace can be read in the host code with:

         .. code-block:: python

            trace_output = debug_mod.read_trace(4, 1, 'my_trace')
            print(trace_output)

         If ``foo`` was executed only once, then ``trace_output`` will be
         a heterogenous list containing a timestamp, the string "Bar",
         and the number 1.