Hypersparse SpMV¶

This example evaluates the performance of sparse matrix-vector multiplication. The kernel records the start and end of spmv by tsc counter. In addition the tsc counters of all PEs are not sychronized in the beginning. To avoid the timing variation among those PEs, f_sync() synchronizes all PEs and samples the reference clock.

The kernel kernel.csl defines a couple of host-callable functions, f_sync(), f_tic() and f_toc() in order to synchronize the PEs and record the timing of spmv.

The kernel allreduce2R1E/pe.csl performs a reduction over the whole rectangle to synchronize the PEs, then the bottom-right PE sends a signal to other PEs to sample the reference clock. The allreduce2R1E is a variant of allreduce in stencil-3d-7pts. The former uses 2 routable colors and 1 entrypoints, the latter uses 1 routable color and 4 entrypoints. allreduce2R1E is designed for spmv kernel which only has three unused colors.

The kernel hypersparse_spmv/pe.csl performs a matrix-vector product (spmv) where the matrix A is hypersparse, partitioned into 2D grids. The input vector x and output vector y are also distributed into 2D grids.

The user has to provide the matrix A in Matrix Market File format with 1-based index. To obtain the best performance, the user may need to reorder the matrix such that the variatoin of the nonzeros of each parition is small. One option is util/analyze.cpp which provides a load balancing algorithm.

The script run.py has the following parameters:

--infile_mtx=<path to mtx file> contains the sparse matrix A
--num_pe_rows=<int> specifies the height of the core rectangle
--num_pe_cols=<int> specifies the width of the core rectangle
--channels=<int> specifies the number of I/O channels, no bigger than 16.

The tic() samples “time_start” and toc() samples “time_end”. The sync() samples “time_ref” which is used to adjust “time_start” and “time_end”. The elapsed time (unit: cycles) is measured by cycles_send = max(time_end) - min(time_start)

The overall runtime (us) is computed via the following formula time_send = (cycles_send / 0.85) * 1.e-3 us

The bandwidth is calculated by bandwidth = ((2*nnz+m)*4)/time_send)

See the SDK examples repository or the release tarball for additional supporting data preparation scripts.

layout.csl¶

// color var           color  var           color  var                color  var
//   0                 10  init             20   tx_east              30 reserved (memcpy)
//   1  c0             11  compute_north    21   reserved (memcpy)    31 reserved
//   2  c1             12  compute_south    22   reserved (memcpy)    32
//   3  c2             13  tx_north         23   reserved (memcpy)    33 reserved (memcpy)
//   4  c3             14  tx_south         24   compute_local        34 reserved (memcpy)
//   5  c4             15  rx_north         25   curr_rx_north_done   35 reserved (memcpy)
//   6  c5             16  rx_south         26   curr_rx_south_done   36 reserved (memcpy)
//   7  allreduce_c0   17  rx_east          27   reserved (memcpy)    37 reserved (memcpy)
//   8  allreduce_c1   18  rx_west          28   reserved (memcpy)
//   9  allreduce_EN1  19  tx_west          29   reserved (memcpy)

// routable colors for spmv
param c0 = @get_color(1);
param c1 = @get_color(2);
param c2 = @get_color(3);
param c3 = @get_color(4);
param c4 = @get_color(5);
param c5 = @get_color(6);

// routable colors for allreduce
param allreduce_c0 = @get_color(7);
param allreduce_c1 = @get_color(8);
// entrypoint for allreduce
param allreduce_EN1: local_task_id = @get_local_task_id(9);

// entrypoints for spmv
param EN1: local_task_id = @get_local_task_id(10);
param EN2: local_task_id = @get_local_task_id(11);
param EN3: local_task_id = @get_local_task_id(12);
param EN4: local_task_id = @get_local_task_id(13);
param EN5: local_task_id = @get_local_task_id(14);
param EN6: local_task_id = @get_local_task_id(15);
param EN7: local_task_id = @get_local_task_id(16);
param EN8: local_task_id = @get_local_task_id(17);
param EN9: local_task_id = @get_local_task_id(18);
param EN10: local_task_id = @get_local_task_id(19);
param EN11: local_task_id = @get_local_task_id(20);
param EN12: local_task_id = @get_local_task_id(24);
param EN13: local_task_id = @get_local_task_id(25);
param EN14: local_task_id = @get_local_task_id(26);


// parameters of spmv layout
//          pcols
//       +----------+
// prows |  core    |
//       +----------+
//
param prows: u16;   // number of PE rows (height of the core rectangle)
param pcols: u16;   // number of PE cols (width of the core rectangle)

// structure of the matrix
param nrows: u32;   // total number of matrix rows
param ncols: u32;   // total number of matrix cols
param max_local_nnz: u16;       // max of the local number of nonzeros (among all PEs)
param max_local_nnz_cols: u16;  // max of the local nnz cols
param max_local_nnz_rows: u16;  // max of the local nnz rows
param local_vec_sz: u16;        // size of local vector
param local_out_vec_sz: u16;    // size of local vector
param y_pad_start_row_idx: u16; // local row index where padding starts


const spmv = @import_module( "hypersparse_spmv/layout.csl", .{
    .colors = [6]color{c0, c1, c2, c3, c4, c5},
    .entrypoints = [14]local_task_id{EN1, EN2, EN3, EN4, EN5, EN6, EN7, EN8, EN9, EN10, EN11, EN12, EN13, EN14},
    .width = pcols,
    .height = prows
    });

const reduce = @import_module( "allreduce2R1E/layout.csl", .{
    .colors = [2]color{allreduce_c0, allreduce_c1},
    .entrypoints = [1]local_task_id{allreduce_EN1},
    .width = pcols,
    .height = prows
    });

const memcpy = @import_module( "<memcpy/get_params>", .{
    .width = pcols,
    .height = prows,
    });

layout {
    // NOTE: This scheme assumes prows >= 4
    @comptime_assert(prows >= 4);

    //         --> px = pcol_id
    //          pcols
    //       +----------+
    // prows |  core    |  | py = prow_id
    //       |          |  V
    //       +----------+
    @set_rectangle(pcols, prows);

    var pcol_id: u16 = 0;
    while (pcol_id < pcols) : (pcol_id += 1) {

        var prow_id: u16 = 0;
        while (prow_id < prows) : (prow_id += 1) {

            const memcpyParams = memcpy.get_params(pcol_id);
            const spmvParams = spmv.get_params(pcol_id, prow_id);
            const reduceParams = reduce.get_params(pcol_id, prow_id);
            var params: comptime_struct = .{
                .memcpyParams = memcpyParams,
                .spmvParams = spmvParams,
                .reduceParams = reduceParams,
                .nrows = nrows,
                .ncols = ncols,
                .local_vec_sz = local_vec_sz,
                .max_local_nnz = max_local_nnz,
                .max_local_nnz_cols = max_local_nnz_cols,
                .max_local_nnz_rows = max_local_nnz_rows,
                .local_out_vec_sz = local_out_vec_sz,
                .y_pad_start_row_idx = y_pad_start_row_idx,
            };
            @set_tile_code(pcol_id, prow_id, "kernel.csl", params);

        } // while prow_id
    } // while pcol_id

    @export_name("mat_vals_buf", [*]f32, true);
    @export_name("x_tx_buf", [*]f32, true);
    @export_name("y_local_buf", [*]f32, true);

    @export_name("mat_rows_buf", [*]u16, true);
    @export_name("mat_col_idx_buf", [*]u16, true);
    @export_name("mat_col_loc_buf", [*]u16, true);
    @export_name("mat_col_len_buf", [*]u16, true);
    @export_name("y_rows_init_buf", [*]u16, true);

    @export_name("local_nnz", [*]u16, true);
    @export_name("local_nnz_cols", [*]u16, true);
    @export_name("local_nnz_rows", [*]u16, true);

    @export_name("time_buf_u16", [*]u16, true);

    @export_name("time_ref_u16", [*]u16, true);

    @export_name("f_enable_tsc", fn()void);
    @export_name("f_tic", fn()void);
    @export_name("f_toc", fn()void);
    @export_name("f_spmv", fn()void);
    @export_name("f_memcpy_timestamps", fn()void);
    @export_name("f_sync", fn(i16)void);
    @export_name("f_reference_timestamps", fn()void);
}

kernel.csl¶

param memcpyParams: comptime_struct;

param spmvParams: comptime_struct;

param reduceParams: comptime_struct;

// parameters
param nrows: u32;   // total number of matrix rows
param ncols: u32;   // total number of matrix cols (= nrows)
param max_local_nnz: u16;       // max of the local number of nonzeros (among all PEs)
param max_local_nnz_cols: u16;  // max of the local nnz cols
param max_local_nnz_rows: u16;  // max of the local nnz rows
param local_vec_sz: u16;    // size of local vector
param local_out_vec_sz: u16;    // size of local vector
param y_pad_start_row_idx: u16;   // local row index where padding starts

// data buffers
// input matrix
var mat_vals_buf = @zeros([max_local_nnz]f32);      // in matrix values (sparse): 4B
// input vector: for north-going and south-going trains
// buffer storing data for tx
var x_tx_buf = @zeros([local_vec_sz]f32);       // in vector values (dense): 4B

var mat_rows_buf = @zeros([max_local_nnz]u16);      // in matrix relative row offsets: 2B
                                                // need this in preprocessing: 2B
var mat_col_idx_buf = @zeros([max_local_nnz_cols]u16);   // column idx of nnz cols (max possible size is nnz)
var mat_col_loc_buf = @zeros([max_local_nnz_cols]u16);   // col location in mat_vals_buf and mat_rows_buf (max nnz)
var mat_col_len_buf = @zeros([max_local_nnz_cols]u16);   // col length (nnz rows in a col)
// precomputed output vector (sparse format) local rows index information
var y_rows_init_buf = @zeros([max_local_nnz_rows]u16);       // init -- this should not be modified

var local_nnz = @zeros([1]u16);         // actual local number of nonzeros
var local_nnz_cols = @zeros([1]u16);    // actual local number of nnz cols
var local_nnz_rows = @zeros([1]u16);    // actual local number of nnz rows

// final reduced local output vector (dense)
var y_local_buf = @zeros([local_out_vec_sz]f32);

// temporary buffer for allreduce
var dot = @zeros([1]f32);

const timestamp = @import_module("<time>");

const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);

// input_queues cannot overlap with output_queues
const spmv_mod = @import_module( "hypersparse_spmv/pe.csl", @concat_structs(spmvParams, .{
     .f_callback = sys_mod.unblock_cmd_stream,

     .nrows = nrows,
     .ncols = ncols,
     .local_vec_sz = local_vec_sz,
     .max_local_nnz = max_local_nnz,
     .max_local_nnz_cols = max_local_nnz_cols,
     .max_local_nnz_rows = max_local_nnz_rows,
     .local_out_vec_sz = local_out_vec_sz,
     .y_pad_start_row_idx = y_pad_start_row_idx,

     .mat_vals_buf = &mat_vals_buf,
     .mat_rows_buf = &mat_rows_buf,
     .mat_col_idx_buf = &mat_col_idx_buf,
     .mat_col_loc_buf = &mat_col_loc_buf,
     .mat_col_len_buf = &mat_col_len_buf,
     .y_rows_init_buf = &y_rows_init_buf,
     .local_nnz = &local_nnz,
     .local_nnz_cols = &local_nnz_cols,
     .local_nnz_rows = &local_nnz_rows,

     .input_queues=[4]u16{4, 1, 6, 7},
     .output_queues=[2]u16{2,3},
     .dest_dsr_ids = [6]u16{1, 4, 5, 6, 2, 3},
     .src1_dsr_ids = [6]u16{4, 1, 6, 7, 2, 3},
     }));

// allreduce uses input queue/output queue 5
// dest_dsr and src0_dsr must be a valid pair, for example (7,1) is invalid
const reduce_mod = @import_module( "allreduce2R1E/pe.csl", @concat_structs(reduceParams, .{
     .f_callback = sys_mod.unblock_cmd_stream,
     .MAX_ZDIM = 1,
     .queues = [1]u16{5},
     .dest_dsr_ids = [1]u16{7},
     .src0_dsr_ids = [1]u16{7},
     .src1_dsr_ids = [1]u16{5}
     }));

// tsc library
var tsc_start_buffer = @zeros([timestamp.tsc_size_words]u16);
var tsc_end_buffer = @zeros([timestamp.tsc_size_words]u16);

// time_buf_u16[0:5] = {tsc_start_buffer, tsc_end_buffer}
var time_buf_u16 = @zeros([timestamp.tsc_size_words*2]u16);
var ptr_time_buf_u16: [*]u16 = &time_buf_u16;

// reference clock inside allreduce module
var time_ref_u16 = @zeros([timestamp.tsc_size_words]u16);
var ptr_time_ref_u16: [*]u16 = &time_ref_u16;

var ptr_mat_vals_buf: [*]f32 = &mat_vals_buf;
var ptr_x_tx_buf: [*]f32 = &x_tx_buf;
var ptr_y_local_buf: [*]f32 = &y_local_buf;
var ptr_mat_rows_buf: [*]u16 = &mat_rows_buf;
var ptr_mat_col_idx_buf: [*]u16 = &mat_col_idx_buf;
var ptr_mat_col_loc_buf: [*]u16 = &mat_col_loc_buf;
var ptr_mat_col_len_buf: [*]u16 = &mat_col_len_buf;
var ptr_y_rows_init_buf: [*]u16 = &y_rows_init_buf;
var ptr_local_nnz: [*]u16 = &local_nnz;
var ptr_local_nnz_cols: [*]u16 = &local_nnz_cols;
var ptr_local_nnz_rows: [*]u16 = &local_nnz_rows;


fn f_enable_tsc() void {
    timestamp.enable_tsc();

    // the user must unblock cmd color for every PE
    sys_mod.unblock_cmd_stream();
}

fn f_tic() void {
    timestamp.get_timestamp(&tsc_start_buffer);

    // the user must unblock cmd color for every PE
    sys_mod.unblock_cmd_stream();
}

fn f_toc() void {
    timestamp.get_timestamp(&tsc_end_buffer);

    // the user must unblock cmd color for every PE
    sys_mod.unblock_cmd_stream();
}

// compute y = A*x
//
// To ping-pong the spmv by
//    spmv(x, y) // y = A*x
//    spmv(y, x) // x = A*y
// we need to make sure local_vec_sz = local_out_vec_sz, otherwise compilation fails
// because of mismatch of the dimensions
//
fn f_spmv() void {
    spmv_mod.spmv(&x_tx_buf, &y_local_buf);
}

fn f_memcpy_timestamps() void {

    time_buf_u16[0] = tsc_start_buffer[0];
    time_buf_u16[1] = tsc_start_buffer[1];
    time_buf_u16[2] = tsc_start_buffer[2];
    time_buf_u16[3] = tsc_end_buffer[0];
    time_buf_u16[4] = tsc_end_buffer[1];
    time_buf_u16[5] = tsc_end_buffer[2];

    // the user must unblock cmd color for every PE
    sys_mod.unblock_cmd_stream();
}

fn f_sync( n: i16 ) void {
   reduce_mod.allreduce(n, &dot);
}

fn f_reference_timestamps() void {

    time_ref_u16[0] = reduce_mod.tscRefBuffer[0];
    time_ref_u16[1] = reduce_mod.tscRefBuffer[1];
    time_ref_u16[2] = reduce_mod.tscRefBuffer[2];

    // the user must unblock cmd color for every PE
    sys_mod.unblock_cmd_stream();
}

comptime{

    @export_symbol(ptr_mat_vals_buf, "mat_vals_buf");
    @export_symbol(ptr_x_tx_buf, "x_tx_buf");
    @export_symbol(ptr_y_local_buf, "y_local_buf");

    @export_symbol(ptr_mat_rows_buf, "mat_rows_buf");
    @export_symbol(ptr_mat_col_idx_buf, "mat_col_idx_buf");
    @export_symbol(ptr_mat_col_loc_buf, "mat_col_loc_buf");
    @export_symbol(ptr_mat_col_len_buf, "mat_col_len_buf");
    @export_symbol(ptr_y_rows_init_buf, "y_rows_init_buf");

    @export_symbol(ptr_local_nnz, "local_nnz");
    @export_symbol(ptr_local_nnz_cols, "local_nnz_cols");
    @export_symbol(ptr_local_nnz_rows, "local_nnz_rows");

    @export_symbol(ptr_time_buf_u16, "time_buf_u16");

    @export_symbol(ptr_time_ref_u16, "time_ref_u16");
}


comptime{
    @export_symbol(f_enable_tsc);
    @export_symbol(f_tic);
    @export_symbol(f_toc);
    @export_symbol(f_memcpy_timestamps);
    @export_symbol(f_spmv);
    @export_symbol(f_sync);
    @export_symbol(f_reference_timestamps);
}

run.py¶

#!/usr/bin/env cs_python
# pylint: disable=too-many-function-args
""" test sparse matrix-vector multiplication

  This example aims at a hypersparse matrix with almost uniform distribution.
  The algorithm partitions the sparse matrix into 2D grids. The algorithm may
  fail if there exists one parition which has too many nonzeros to fit the
  memory capacity (48KB) of the PE.

  To obtain the best performance, the user may need to reorder the matrix such
  that the variatoin of the nonzeros of each parition is small.

  To run this example, the user has to provide a file of Matrix Market File
  format with 1-based index. For example, the user can reorder the matrix A by
  the permutation matrices P and Q, and writes P*A*Q^T to a file. One option is
  "util/analyze.cpp" which provides a load balancing algorithm.

  This example reads a MTX file, generates the vector x, partitions the matrix,
  and computes y = A*x.

  The framework is
  ---
       sync()  // synchronize all PEs to sample the reference clock
       tic()   // record start time
       spmv()  // compute y = A*x
       toc()   // record end time
  ---

  The tic() samples "time_start" and toc() samples "time_end". The sync() samples
  "time_ref" which is used to shift "time_start" and "time_end".
  The elapsed time is measured by
       cycles_send = max(time_end) - min(time_start)

  The overall runtime is computed via the following formula
       time_send = (cycles_send / 0.85) *1.e-3 us
  where a PE runs with clock speed 850MHz

  The spmv kernel performs y = A * x
  where A is m-by-n with nnz nonzeros

  The standard measurement counts the number of memory access of
       y[i] = sum{ Aij * xj : Aij is nonzero }
  - read Aij: nnz
  - read xj: nnz
  - write y[i]: m
  Total number of memory access: (2*nnz + m) f32

  Here is the list of parameters:
    --infile_mtx=<path to mtx file> contains the sparse matrix A
    --num_pe_rows=<int> specifies the height of the core rectangle
    --num_pe_cols=<int> specifies the width of the core rectangle
    --channels=<int> specifies the number of I/O channels, no bigger than 16

  How to compile and run
     To build a 5-by-4 core rectangle, we need to pass --num_pe_cols=5 --num_pe_rows=4
     Use the following command to compile
        python run.py --arch=wse2 --num_pe_cols=5 --num_pe_rows=4 --channels=1
           --driver=<path to cslc> --compile-only --infile_mtx=<path to mtx file>
     Use the following command to run
        python run.py --arch=wse2 --num_pe_cols=5 --num_pe_rows=4 --channels=1
           --is_weight_one --run-only --infile_mtx=<path to mtx file>
"""

import math
import shutil
import subprocess
import time
from pathlib import Path
from typing import Optional

import pandas as pd
import numpy as np
from cmd_parser import parse_args
from memory_usage import memory_per_pe
from preprocess import preprocess
from scipy import sparse
from scipy.io import mmread

from cerebras.sdk.runtime.sdkruntimepybind import (  # pylint: disable=no-name-in-module
    MemcpyDataType, MemcpyOrder, SdkRuntime,
)

# from cerebras.sdk.debug.debug_util import debug_util


def make_u48(words):
  return words[0] + (words[1] << 16) + (words[2] << 32)


def hwl_to_oned_colmajor(height: int, width: int, pe_length: int, A_hwl: np.ndarray, dtype):
  """
    Given a 3-D tensor A[height][width][pe_length], transform it to
    1D array by column-major
    """
  if A_hwl.dtype == np.float32:
    A_1d = np.zeros(height * width * pe_length, dtype)
    idx = 0
    for l in range(pe_length):
      for w in range(width):
        for h in range(height):
          A_1d[idx] = A_hwl[(h, w, l)]
          idx = idx + 1
  elif A_hwl.dtype == np.uint16:
    assert dtype == np.uint32, "only support dtype = u32 if A is f16"
    A_1d = np.zeros(height * width * pe_length, dtype)
    idx = 0
    for l in range(pe_length):
      for w in range(width):
        for h in range(height):
          x = A_hwl[(h, w, l)]
          # x can be (np.float16, np.int16, np.uint16)
          # convert x to u16
          z = x.view(np.uint16)
          # zero extension of u16
          A_1d[idx] = np.uint32(z)
          idx = idx + 1
  else:
    raise RuntimeError(f"{type(A_hwl)} is not supported")

  return A_1d


def oned_to_hwl_colmajor(height: int, width: int, pe_length: int, A_1d: np.ndarray, dtype):
  """
    Given a 1-D tensor A_1d[height*width*pe_length], transform it to
    3-D tensor A[height][width][pe_length] by column-major
    """
  if dtype == np.float32:
    # only support f32 to f32
    assert A_1d.dtype == np.float32, "only support f32 to f32"
    A_hwl = np.reshape(A_1d, (height, width, pe_length), order="F")

  elif dtype == np.uint16:
    # only support u32 to u16 by dropping upper 16-bit
    assert A_1d.dtype == np.uint32, "only support u32 to u16"
    A_hwl = np.zeros((height, width, pe_length), dtype)
    idx = 0
    for l in range(pe_length):
      for w in range(width):
        for h in range(height):
          x = A_1d[idx]
          x = x & 0x0000FFFF  # drop upper 16-bit
          A_hwl[(h, w, l)] = np.uint16(x)
          idx = idx + 1
  else:
    raise RuntimeError(f"{dtype} is not supported")

  return A_hwl


def read_input_vector(IS_INVEC_1, vec_len):
  if IS_INVEC_1:
    return np.ones(vec_len).astype(np.float32)

  np.random.seed(0)
  return np.random.rand(vec_len).astype(np.float32)


# x is distributed into the core rectangle by the following steps
# step 1: distribute x into columns
#    vec_len_per_pe_col = ceil(vec_len / np_cols)
# step 2: distribute the column into PEs
#    vec_len_per_pe = ceil(vec_len_per_pe_col / np_rows)
#
# For example, if core rectangle is 2-by-2 and local_vec_sz is 13
#    Each column has vec_len_per_pe_col = ceil(13/2) = 7
#    The size of result is 7*2 = 14 which is bigger than local_vec_sz due to padding
#    Each PE has vec_len_per_pe = ceil(7/2) = 4
#
# If x is {1,2,3,4,5,6,7,8,9,10,11,12,13}, the core has
#          PE.x=0      PE.x=1
#    +-------------+-------------+
#    | {1,2,3,4}   | {8,9,10,11} | PE.y=0
#    +-------------+-------------+
#    | {5,6,7,x}   | {12,13,x,x} | PE.y=1
#    +-------------+-------------+
# column 0 has 7 elements, {1,2,3,4,5,6,7}
# column 1 has 6 elements, {8,9,10,11,12,13}
#
# The symbol x is DON'T CARE
#
def dist_x_to_hwl(ncols, x, local_vec_sz, np_cols, np_rows):
  # core rectangle is np_cols-by-np_rows
  #            np_cols
  #         +----------+
  # np_rows |  core    |
  #         +----------+
  # input vector is distributed into columns, then distributed into rows

  vec_len = ncols
  vec_len_per_pe_col = math.ceil(vec_len / np_cols)
  vec_len_per_pe = math.ceil(vec_len_per_pe_col / np_rows)
  assert vec_len_per_pe == local_vec_sz

  pad_len_per_pe_col = (vec_len_per_pe * np_rows) - vec_len_per_pe_col

  pad_len = (vec_len_per_pe_col * np_cols) - vec_len
  # invec = [x, ones(pad_len)]
  invec = np.copy(x)
  ## BIG NOTE: Since this is input vector, padding needs to be 1s
  if pad_len > 0:
    invec = np.append(invec, np.ones(pad_len))

  x_hwl = np.zeros((np_rows, np_cols, vec_len_per_pe), x.dtype)
  ## now this is equally divided into np_cols
  for col in range(np_cols):
    ## get the slice for this col and append padding
    invec_col = invec[col * vec_len_per_pe_col:(col + 1) * vec_len_per_pe_col]
    if pad_len_per_pe_col > 0:
      invec_col = np.append(invec_col, np.ones(pad_len_per_pe_col)).astype(x.dtype)
    ## now this is equally divided into np_rows
    for row in range(np_rows):
      ## get the slice for this row
      data = invec_col[row * vec_len_per_pe:(row + 1) * vec_len_per_pe]
      x_hwl[(row, col)] = data

  return x_hwl


# The dimension of out_vec is h-by-w-by-l
# h = np_rows is the height of the core
# w = np_cols is the width of the core
# l = local_out_vec_sz is the size of local vector
#
# The out_vec_sz is the length of y = A*x
#
# y is distributed into the core rectangle by the following steps
# step 1: distribute y into rows
#    vec_len_per_pe_row = math.ceil(out_vec_sz / np_rows)
# step 2: distribute the row into PEs
#    vec_len_per_pe = math.ceil(vec_len_per_pe_row / np_cols)
#
# If out_vec_sz is smaller than (vec_len_per_pe_row*np_rows), padding is added
#
# The function unpad_3d_to_1d returns a result of size (vec_len_per_pe_row*np_rows)
#
# For example, if core rectangle is 2-by-2 and out_vec_sz is 13
#    Each row has vec_len_per_pe_row = ceil(13/2) = 7
#    The size of result is 7*2 = 14 which is bigger than out_vec_sz due to padding
#    Each PE has vec_len_per_pe = ceil(7/2) = 4
#
# If y is {1,2,3,4,5,6,7,8,9,10,11,12,13}, the core has
#          PE.x=0      PE.x=1
#    +-------------+-------------+
#    | {1,2,3,4}   | {5,6,7,x}   | PE.y=0
#    +-------------+-------------+
#    | {8,9,10,11} | {12,13,x,x} | PE.y=1
#    +-------------+-------------+
# row 0 has 7 elements, {1,2,3,4,5,6,7
# row 1 has 6 elements, {8,9,10,11,12,13}
#
# The symbol x is DON'T CARE
#
def unpad_3d_to_1d(out_vec_sz, out_vec):
  assert out_vec.ndim == 3, "y must be a 3-d tensor of the form h-by-w-by-l"
  (height, width, local_out_vec_sz) = out_vec.shape
  # core rectangle is np_cols-by-np_rows
  #            np_cols
  #         +----------+
  # np_rows |  core    |
  #         +----------+
  np_rows = height
  np_cols = width

  vec_len_per_pe_row = math.ceil(out_vec_sz / np_rows)
  vec_len_per_pe = math.ceil(vec_len_per_pe_row / np_cols)
  # check if local_out_vec_sz = math.ceil(math.ceil(out_vec_sz / np_rows) / np_cols)
  assert vec_len_per_pe == local_out_vec_sz

  # result includes the padding
  #    y = result[0:out_vec_sz]
  # clear result to avoid bogus value outside the range [0, out_vec_sz)
  result = np.zeros(vec_len_per_pe_row * np_rows, dtype=np.float32)
  # tmp_buf contains the padding one row PEs
  # tmp_buf gathers data of a whole row PE
  tmp_buf = np.empty(vec_len_per_pe * np_cols, dtype=np.float32)
  for row in range(np_rows):
    low_idx = row * vec_len_per_pe_row
    high_idx = low_idx + vec_len_per_pe_row
    # gather data into tmp_buf
    for col in range(np_cols):
      start = col * vec_len_per_pe
      end = start + vec_len_per_pe
      tmp_buf[start:end] = out_vec[(row, col)]
    result[low_idx:high_idx] = tmp_buf[0:vec_len_per_pe_row]
  return result


def verify_result(ref, res):
  print("Comparing result with reference...")
  abs_diff = np.sum(abs(ref - res))
  abs_rel = abs_diff / len(ref)
  print(f"reference[{len(ref)}]: \n{ref}")
  print(f"result   [{len(res)}]: \n{res}")
  print(f"[[ Absolute diff: {abs_diff} ]]")
  print(f"[[ Average diff : {abs_rel} ]]")
  atol = 1e-8
  rtol = 1e-5
  is_correct = np.allclose(ref, res, rtol, atol)
  result = "PASS" if is_correct else "FAIL"
  print(f"[[ Result within tolerance {atol}: {result} ]]")
  print(f"[[ Result within tolerance {atol}: {result} ]]")
  if not is_correct:
    unequal = ~np.isclose(ref, res)
    unequal_idx = list(np.where(unequal))
    mismatches = list(zip(ref[tuple(unequal_idx)], res[tuple(unequal_idx)]))
    df = pd.DataFrame(mismatches, columns=["reference", "result"], index=unequal_idx)
    print(f"{df}")


# y = A*x
# where A is nrows-by-ncols, represented by a CSR triplet
def generate_reference(nrows, ncols, csrRowPtr, csrColInd, csrVal, x):
  assert ncols == len(x), "the dimension of x does not match the dimension of A"
  mat = sparse.csr_matrix((csrVal, csrColInd, csrRowPtr), shape=(nrows, ncols))
  y = mat.dot(np.array(x).transpose())
  return y


def timing_analysis(height, width, nnz, time_memcpy_hwl, time_ref_hwl):
  time_start = np.zeros((height, width)).astype(int)
  time_end = np.zeros((height, width)).astype(int)
  word = np.zeros(3).astype(np.uint16)
  for w in range(width):
    for h in range(height):
      word[0] = time_memcpy_hwl[(h, w, 0)]
      word[1] = time_memcpy_hwl[(h, w, 1)]
      word[2] = time_memcpy_hwl[(h, w, 2)]
      time_start[(h, w)] = make_u48(word)
      word[0] = time_memcpy_hwl[(h, w, 3)]
      word[1] = time_memcpy_hwl[(h, w, 4)]
      word[2] = time_memcpy_hwl[(h, w, 5)]
      time_end[(h, w)] = make_u48(word)

  # time_ref = reference clock
  time_ref = np.zeros((height, width)).astype(int)
  word = np.zeros(3).astype(np.uint16)
  for w in range(width):
    for h in range(height):
      word[0] = time_ref_hwl[(h, w, 0)]
      word[1] = time_ref_hwl[(h, w, 1)]
      word[2] = time_ref_hwl[(h, w, 2)]
      time_ref[(h, w)] = make_u48(word)

  # adjust the reference clock by the propagation delay
  # the right-bottom PE signals other PEs, the propagation delay is
  #     (h-1) - py + (w-1) - px
  for py in range(height):
    for px in range(width):
      time_ref[(py, px)] = time_ref[(py, px)] - ((width + height - 2) - (px + py))

  # shift time_start and time_end by time_ref
  time_start = time_start - time_ref
  time_end = time_end - time_ref

  # cycles_send = time_end[(h,w)] - time_start[(h,w)]
  # 850MHz --> 1 cycle = (1/0.85) ns = (1/0.85)*1.e-3 us
  # time_send = (cycles_send / 0.85) *1.e-3 us
  #
  # The spmv kernel performs y = A * x
  #   y[i] = sum{ Aij * xj : Aij is nonzero }
  # where A is m-by-n with nnz nonzeros
  #
  # We use the following standard measurement
  # - read Aij: nnz
  # - read xj: nnz
  # - write y[i]: m
  # Total number of wavelets: (2*nnz + m)
  #
  wvlts = 2 * nnz + height
  min_time_start = time_start.min()
  max_time_end = time_end.max()
  cycles_send = max_time_end - min_time_start
  time_send = (cycles_send / 0.85) * 1.0e-3
  bandwidth = (wvlts * 4) / time_send
  print(f"cycles_send = {cycles_send} cycles")
  print(f"time_send = {time_send} us")
  print(f"bandwidth = {bandwidth} MB/S ")


def csl_compile_core(
    cslc: str,
    file_config: str,
    elf_dir: str,
    fabric_width: int,
    fabric_height: int,
    core_fabric_offset_x: int,  # fabric-offsets of the core
    core_fabric_offset_y: int,
    use_precompile: bool,
    arch: Optional[str],
    ncols: int,
    nrows: int,
    np_cols: int,
    np_rows: int,
    max_local_nnz: int,
    max_local_nnz_cols: int,
    max_local_nnz_rows: int,
    local_vec_sz: int,
    local_out_vec_sz: int,
    out_pad_start_idx: int,
    channels: int,
    width_west_buf: int,
    width_east_buf: int,
):
  comp_dir = elf_dir

  if not use_precompile:
    args = []
    args.append(cslc)  # command
    args.append(file_config)  # options
    args.append(f"--fabric-dims={fabric_width},{fabric_height}")  # options
    args.append(f"--fabric-offsets={core_fabric_offset_x},{core_fabric_offset_y}")  # options
    args.append(f"--params=ncols:{ncols}")  # options
    args.append(f"--params=nrows:{nrows}")  # options
    args.append(f"--params=pcols:{np_cols}")  # options
    args.append(f"--params=prows:{np_rows}")  # options
    args.append(f"--params=max_local_nnz:{max_local_nnz}")  # options
    args.append(f"--params=max_local_nnz_cols:{max_local_nnz_cols}")  # options
    args.append(f"--params=max_local_nnz_rows:{max_local_nnz_rows}")  # options
    args.append(f"--params=local_vec_sz:{local_vec_sz}")  # options
    args.append(f"--params=local_out_vec_sz:{local_out_vec_sz}")  # options
    args.append(f"--params=y_pad_start_row_idx:{out_pad_start_idx}")  # options

    args.append(f"-o={comp_dir}")
    if arch is not None:
      args.append(f"--arch={arch}")
    args.append("--memcpy")
    args.append(f"--channels={channels}")
    args.append(f"--width-west-buf={width_west_buf}")
    args.append(f"--width-east-buf={width_east_buf}")

    print(f"subprocess.check_call(args = {args}")
    subprocess.check_call(args)
  else:
    print("[csl_compile_core] use pre-compile ELFs")


def main():
  """Main method to run the example code."""

  args = parse_args()

  cslc = "cslc"
  if args.driver is not None:
    cslc = args.driver
  print(f"cslc = {cslc}")

  width_west_buf = args.width_west_buf
  width_east_buf = args.width_east_buf
  channels = args.channels
  assert channels <= 16, "only support up to 16 I/O channels"
  assert channels >= 1, "number of I/O channels must be at least 1"

  print(f"width_west_buf = {width_west_buf}")
  print(f"width_east_buf = {width_east_buf}")
  print(f"channels = {channels}")

  dirname = args.latestlink

  # core rectangle is np_cols-by-np_rows
  np_cols = args.num_pe_cols
  np_rows = args.num_pe_rows
  IS_INVEC_1 = args.is_invec_one

  width = np_cols
  height = np_rows
  print(f"width = {width}, height = {height}")

  start = time.time()
  infile_mtx = args.infile_mtx
  print(f"infile_mtx = {infile_mtx}")

  A_coo = mmread(infile_mtx)
  # the CSR format is 0-based
  A_csr = A_coo.tocsr(copy=True)
  # sort column indices
  A_csr = A_csr.sorted_indices().astype(np.float32)
  assert A_csr.has_sorted_indices == 1, "Error: A is not sorted"

  [nrows, ncols] = A_csr.shape
  nnz = A_csr.nnz

  print(f"Load matrix A, {nrows}-by-{ncols} with {nnz} nonzeros")

  if not args.is_weight_one:
    print("WARNING: reset the matrix with random values")
    np.random.seed(123)
    (A_csr.data)[0:nnz] = np.random.rand(nnz).astype(np.float32)

  csrRowPtr = A_csr.indptr
  csrColInd = A_csr.indices
  csrVal = A_csr.data

  A_csc = A_csr.tocsc(copy=True)
  # sort row indices
  A_csc = A_csc.sorted_indices().astype(np.float32)
  assert A_csc.has_sorted_indices == 1, "Error: A is not sorted"

  cscColPtr = A_csc.indptr
  cscRowInd = A_csc.indices
  cscVal = A_csc.data

  matrix_info = preprocess(
      # A is nrows-by-ncols with nnz nonzeros
      nrows,
      ncols,
      nnz,
      # core rectangle is fabx-by-faby
      np_cols,
      np_rows,
      # (csrRowPtr, csrColInd, csrVal) is the CSR representation
      csrRowPtr,
      csrColInd,
      # (cscColPtr, cscRowInd, cscVal) is the CSC representation
      cscColPtr,
      cscRowInd,
      cscVal,
  )

  end = time.time()
  print(f"prepare the structure for spmv kernel: {end-start}s", flush=True)

  max_local_nnz = matrix_info["max_local_nnz"]
  max_local_nnz_cols = matrix_info["max_local_nnz_cols"]
  max_local_nnz_rows = matrix_info["max_local_nnz_rows"]
  mat_vals_buf = matrix_info["mat_vals_buf"]
  mat_rows_buf = matrix_info["mat_rows_buf"]
  mat_col_idx_buf = matrix_info["mat_col_idx_buf"]
  mat_col_loc_buf = matrix_info["mat_col_loc_buf"]
  mat_col_len_buf = matrix_info["mat_col_len_buf"]
  y_rows_init_buf = matrix_info["y_rows_init_buf"]
  local_nnz = matrix_info["local_nnz"]
  local_nnz_cols = matrix_info["local_nnz_cols"]
  local_nnz_rows = matrix_info["local_nnz_rows"]

  x_ref = read_input_vector(IS_INVEC_1, ncols)

  # core rectangle is np_cols-by-np_rows
  #            np_cols
  #         +----------+
  # np_rows |  core    |
  #         +----------+
  # input vector is distributed into columns, then distributed into rows
  # output vector is distributed into rows, then distributed into columns
  local_vec_sz = math.ceil(math.ceil(ncols / np_cols) / np_rows)
  local_out_vec_sz = math.ceil(math.ceil(nrows / np_rows) / np_cols)

  x_tx_buf = dist_x_to_hwl(ncols, x_ref, local_vec_sz, np_cols, np_rows)

  print("Generating reference y = A*x ...")
  y_ref = generate_reference(nrows, ncols, csrRowPtr, csrColInd, csrVal, x_ref)

  mem_use_per_pe = memory_per_pe(
      max_local_nnz,
      max_local_nnz_cols,
      max_local_nnz_rows,
      local_vec_sz,
      local_out_vec_sz,
  )
  print(
      f"Total memory use per PE = {mem_use_per_pe} bytes = {mem_use_per_pe / 1024} KB",
      flush=True,
  )
  assert (mem_use_per_pe < 46 * 1024), "exceed maximum memory capacity, increase the core rectangle"

  # fabric-offsets = 1,1
  fabric_offset_x = 1
  fabric_offset_y = 1
  # starting point of the core rectangle = (core_fabric_offset_x, core_fabric_offset_y)
  # memcpy framework requires 3 columns at the west of the core rectangle
  # memcpy framework requires 2 columns at the east of the core rectangle
  core_fabric_offset_x = fabric_offset_x + 3 + width_west_buf
  core_fabric_offset_y = fabric_offset_y
  # (min_fabric_width, min_fabric_height) is the minimal dimension to run the app
  min_fabric_width = core_fabric_offset_x + width + 2 + 1 + width_east_buf
  min_fabric_height = core_fabric_offset_y + height + 1

  fabric_width = 0
  fabric_height = 0
  if args.fabric_dims:
    w_str, h_str = args.fabric_dims.split(",")
    fabric_width = int(w_str)
    fabric_height = int(h_str)

  if fabric_width == 0 or fabric_height == 0:
    fabric_width = min_fabric_width
    fabric_height = min_fabric_height

  assert fabric_width >= min_fabric_width
  assert fabric_height >= min_fabric_height

  print(f"fabric_width = {fabric_width}, fabric_height = {fabric_height}")
  print(
      f"core_fabric_offset_x = {core_fabric_offset_x}, "
      f"core_fabric_offset_y = {core_fabric_offset_y}"
  )

  # prepare the simulation
  print("store ELFs and log files in the folder ", dirname)

  # layout of a rectangle
  code_csl = "src/layout.csl"

  ## calculate the output vector padding info
  out_vec_len_per_pe_row = math.ceil(nrows / np_rows)
  out_pad_start_idx = out_vec_len_per_pe_row

  start = time.time()
  csl_compile_core(
      cslc,
      code_csl,
      dirname,
      fabric_width,
      fabric_height,
      core_fabric_offset_x,  # fabric-offsets of the core
      core_fabric_offset_y,
      args.run_only,
      args.arch,
      ncols,  # m, number of rows of the matrix
      nrows,  # n, number of columns of the matrix
      np_cols,  # width
      np_rows,  # height
      max_local_nnz,
      max_local_nnz_cols,
      max_local_nnz_rows,
      local_vec_sz,
      local_out_vec_sz,
      out_pad_start_idx,
      channels,
      width_west_buf,
      width_east_buf,
  )
  end = time.time()
  print(f"Compilation done in {end-start}s", flush=True)

  if args.compile_only:
    print("COMPILE ONLY: EXIT")
    return

  runner = SdkRuntime(dirname, cmaddr=args.cmaddr)

  sym_mat_vals_buf = runner.get_id("mat_vals_buf")
  sym_x_tx_buf = runner.get_id("x_tx_buf")
  sym_y_local_buf = runner.get_id("y_local_buf")

  sym_mat_rows_buf = runner.get_id("mat_rows_buf")
  sym_mat_col_idx_buf = runner.get_id("mat_col_idx_buf")
  sym_mat_col_loc_buf = runner.get_id("mat_col_loc_buf")
  sym_mat_col_len_buf = runner.get_id("mat_col_len_buf")
  sym_y_rows_init_buf = runner.get_id("y_rows_init_buf")
  sym_local_nnz = runner.get_id("local_nnz")
  sym_local_nnz_cols = runner.get_id("local_nnz_cols")
  sym_local_nnz_rows = runner.get_id("local_nnz_rows")
  sym_time_buf_u16 = runner.get_id("time_buf_u16")
  sym_time_ref_u16 = runner.get_id("time_ref_u16")

  start = time.time()
  runner.load()
  end = time.time()
  print(f"*** Load done in {end-start}s")

  start = time.time()
  runner.run()

  print("step 1: enable tsc counter to sample the clock")
  runner.launch("f_enable_tsc", nonblock=True)

  print("step 2: copy the structure of A and vector x to the device")
  # 1. mat_vals_buf[max_local_nnz], type = f32
  mat_vals_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz, mat_vals_buf, np.float32)
  runner.memcpy_h2d(
      sym_mat_vals_buf,
      mat_vals_buf_1d,
      0,
      0,
      width,
      height,
      max_local_nnz,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_32BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=True,
  )

  # 2: x_tx_buf[local_vec_sz], type = f32
  x_tx_buf_1d = hwl_to_oned_colmajor(height, width, local_vec_sz, x_tx_buf, np.float32)
  runner.memcpy_h2d(
      sym_x_tx_buf,
      x_tx_buf_1d,
      0,
      0,
      width,
      height,
      local_vec_sz,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_32BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=True,
  )

  # 3: mat_rows_buf[max_local_nnz], type = u16
  mat_rows_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz, mat_rows_buf, np.uint32)
  runner.memcpy_h2d(
      sym_mat_rows_buf,
      mat_rows_buf_1d,
      0,
      0,
      width,
      height,
      max_local_nnz,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_16BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=True,
  )

  # 4: mat_col_idx_buf[max_local_nnz_cols], type = u16
  mat_col_idx_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_cols, mat_col_idx_buf,
                                            np.uint32)
  runner.memcpy_h2d(
      sym_mat_col_idx_buf,
      mat_col_idx_buf_1d,
      0,
      0,
      width,
      height,
      max_local_nnz_cols,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_16BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=True,
  )

  # 5: mat_col_loc_buf[max_local_nnz_cols], type = u16
  mat_col_loc_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_cols, mat_col_loc_buf,
                                            np.uint32)
  runner.memcpy_h2d(
      sym_mat_col_loc_buf,
      mat_col_loc_buf_1d,
      0,
      0,
      width,
      height,
      max_local_nnz_cols,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_16BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=True,
  )

  # 6: mat_col_len_buf[max_local_nnz_cols], type = u16
  mat_col_len_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_cols, mat_col_len_buf,
                                            np.uint32)
  runner.memcpy_h2d(
      sym_mat_col_len_buf,
      mat_col_len_buf_1d,
      0,
      0,
      width,
      height,
      max_local_nnz_cols,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_16BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=True,
  )

  # 7: y_rows_init_buf[max_local_nnz_rows], type = u16
  y_rows_init_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_rows, y_rows_init_buf,
                                            np.uint32)
  runner.memcpy_h2d(
      sym_y_rows_init_buf,
      y_rows_init_buf_1d,
      0,
      0,
      width,
      height,
      max_local_nnz_rows,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_16BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=True,
  )

  # 8: local_nnz, type = u16
  local_nnz_1d = hwl_to_oned_colmajor(height, width, 1, local_nnz, np.uint32)
  runner.memcpy_h2d(
      sym_local_nnz,
      local_nnz_1d,
      0,
      0,
      width,
      height,
      1,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_16BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=True,
  )

  # 9: local_nnz_cols, type = u16
  local_nnz_cols_1d = hwl_to_oned_colmajor(height, width, 1, local_nnz_cols, np.uint32)
  runner.memcpy_h2d(
      sym_local_nnz_cols,
      local_nnz_cols_1d,
      0,
      0,
      width,
      height,
      1,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_16BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=True,
  )

  # 10: local_nnz_rows, type = u16
  local_nnz_rows_1d = hwl_to_oned_colmajor(height, width, 1, local_nnz_rows, np.uint32)
  runner.memcpy_h2d(
      sym_local_nnz_rows,
      local_nnz_rows_1d,
      0,
      0,
      width,
      height,
      1,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_16BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=True,
  )

  print("step 3: sync all PEs to sample the reference clock")
  runner.launch("f_sync", np.int16(1), nonblock=False)

  print("step 4: tic() records time_start")
  runner.launch("f_tic", nonblock=True)

  print("step 5: spmv")
  runner.launch("f_spmv", nonblock=False)

  print("step 5: toc() records time_end")
  runner.launch("f_toc", nonblock=False)

  print("step 6: prepare (time_start, time_end)")
  runner.launch("f_memcpy_timestamps", nonblock=False)

  print("step 7: fetch the timing time_buf_u16[6] = (time_start, time_end), type = u16")
  time_memcpy_hwl_1d = np.zeros(height * width * 6, np.uint32)
  runner.memcpy_d2h(
      time_memcpy_hwl_1d,
      sym_time_buf_u16,
      0,
      0,
      width,
      height,
      6,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_16BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=False,
  )
  time_memcpy_hwl = oned_to_hwl_colmajor(height, width, 6, time_memcpy_hwl_1d, np.uint16)

  print("step 8: fetch the output vector y of type f32")
  y_1d = np.zeros(height * width * local_out_vec_sz, np.float32)
  runner.memcpy_d2h(
      y_1d,
      sym_y_local_buf,
      0,
      0,
      width,
      height,
      local_out_vec_sz,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_32BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=False,
  )

  print("step 9: prepare reference clock")
  runner.launch("f_reference_timestamps", nonblock=False)

  print("step 10: D2H reference clock")
  time_ref_1d = np.zeros(height * width * 3, np.uint32)
  runner.memcpy_d2h(
      time_ref_1d,
      sym_time_ref_u16,
      0,
      0,
      width,
      height,
      3,
      streaming=False,
      data_type=MemcpyDataType.MEMCPY_16BIT,
      order=MemcpyOrder.COL_MAJOR,
      nonblock=False,
  )
  time_ref_hwl = oned_to_hwl_colmajor(height, width, 3, time_ref_1d, np.uint16)

  runner.stop()

  end = time.time()
  print(f"*** Run done in {end-start}s")

  timing_analysis(height, width, nnz, time_memcpy_hwl, time_ref_hwl)

  # The output y_wse distributed into nrows-by-ncols PEs
  y_wse = np.reshape(y_1d, (height, width, local_out_vec_sz), order="F")
  # y_wse is packed into 1d vector with zero padding
  y_wse = unpad_3d_to_1d(nrows, y_wse)
  # remove padding of y_wse because y_ref has no padding
  verify_result(y_ref, y_wse[0:nrows])

  if args.simulator:
    # move simulation log and core dump to the given folder
    dst_log = Path(f"{dirname}/sim.log")
    src_log = Path("sim.log")
    if src_log.exists():
      shutil.move(src_log, dst_log)

    dst_trace = Path(f"{dirname}/simfab_traces")
    src_trace = Path("simfab_traces")
    if dst_trace.exists():
      shutil.rmtree(dst_trace)
    if src_trace.exists():
      shutil.move(src_trace, dst_trace)

  # dump the device memory via debug tool
  if args.simulator:
    print(f"time_ref_hwl = \n{time_ref_hwl}")
    #debug_mod = debug_util(dirname, cmaddr=args.cmaddr)
    #for py in range(height):
    #  for px in range(width):
    #    t = debug_mod.get_symbol(
    #        core_fabric_offset_x + px,
    #        core_fabric_offset_y + py,
    #        "time_ref_u16",
    #        np.uint16,
    #    )
    #    print(f"(py, px) = {py, px}, time_ref_u16_ij = {t}")


if __name__ == "__main__":
  main()

cmd_parser.py¶

import argparse


def parse_args():
  parser = argparse.ArgumentParser()
  parser.add_argument("--infile_mtx", help="the sparse matrix in MTX format", required=True)
  parser.add_argument("--simulator", action="store_true", help="Runs on simulator")
  parser.add_argument(
      "--num_pe_cols",
      type=int,
      help="width of the core rectangle",
      required=True,
  )
  parser.add_argument(
      "--num_pe_rows",
      type=int,
      help="height of the core rectangle",
      required=True,
  )
  parser.add_argument("--fabric-dims", help="Fabric dimension, i.e. <W>,<H>")
  parser.add_argument("--compile-only", help="Compile only", action="store_true")
  parser.add_argument("--run-only", help="Run only", action="store_true")
  parser.add_argument("--width-west-buf", default=0, type=int, help="width of west buffer")
  parser.add_argument("--width-east-buf", default=0, type=int, help="width of east buffer")
  parser.add_argument(
      "--channels",
      default=1,
      type=int,
      help="number of I/O channels, between 1 and 16",
  )
  parser.add_argument(
      "-d",
      "--driver",
      help="The path to the CSL compiler",
  )
  parser.add_argument("--cmaddr", help="CM address and port, i.e. <IP>:<port>")
  parser.add_argument("--arch", help="wse2 or wse3. Default is wse2 when not supplied.")
  parser.add_argument(
      "--is_invec_one",
      help="input vector x is all one",
      action="store_true",
      default=False,
  )
  parser.add_argument(
      "--is_weight_one",
      help="matrix A is from the given matrix or all one",
      action="store_true",
      default=False,
  )
  parser.add_argument(
      "--latestlink",
      help="folder to contain the log files (default: latest)",
      default="latest",
  )

  args = parser.parse_args()

  if args.cmaddr is None:
    args.simulator = False

  return args

memory_usage.py¶

import numpy as np


def memory_per_pe(
    max_local_nnz,
    max_local_nnz_cols,
    max_local_nnz_rows,
    local_in_vec_sz,
    local_out_vec_sz,
):
  """
    // input matrix
    var mat_vals_buf = @zeros([max_local_nnz]f32);      // in matrix values (sparse): 4B
    var mat_rows_buf = @zeros([max_local_nnz]u16);      // in matrix relative row offsets: 2B
                                                        // need this in preprocessing: 2B
    // column idx of nnz cols (max possible size is nnz)
    var mat_col_idx_buf = @zeros([max_local_nnz_cols]u16);
    // col location in mat_vals_buf and mat_rows_buf (max nnz)
    var mat_col_loc_buf = @zeros([max_local_nnz_cols]u16);
    // col length (nnz rows in a col)
    var mat_col_len_buf = @zeros([max_local_nnz_cols]u16);

    // input vector: for north-going and south-going trains
    // buffer storing data for tx
    var x_tx_buf = @zeros([local_vec_sz]f32);       // in vector values (dense): 4B
    // double buffers storing rx data
    var x_north_buf0 = @zeros([local_vec_sz]f32);   // in vector values (dense): 4B
    var x_south_buf0 = @zeros([local_vec_sz]f32);   // in vector values (dense): 4B
    var x_north_buf1 = @zeros([local_vec_sz]f32);   // in vector values (dense): 4B
    var x_south_buf1 = @zeros([local_vec_sz]f32);   // in vector values (dense): 4B

    // precomputed output vector (sparse format) local rows index information
    var y_rows_init_buf = @zeros([max_local_nnz_rows]u16);        // init -- should not be modified

    // output vector (sparse): to store partial computed output vectors for north and south trains
    var y_vals_north_buf = @zeros([max_local_nnz_rows]f32);       // 4B
    var y_rows_north_buf = @zeros([max_local_nnz_rows]u16);       // 2B
    var y_vals_south_buf = @zeros([max_local_nnz_rows]f32);       // 4B
    var y_rows_south_buf = @zeros([max_local_nnz_rows]u16);       // 2B

    // buffers for east and west trains
    // rx/tx vals on west-train during reduction (sparse): 4B
    var y_vals_west_buf = @zeros([max_local_nnz_rows]f32);
    // rx/tx rows on west-train during reduction (sparse): 4B
    var y_rows_west_buf = @zeros([max_local_nnz_rows]u16);
    // rx/tx vals on east-train during reduction (sparse): 4B
    var y_vals_east_buf = @zeros([max_local_nnz_rows]f32);
    // rx/tx rows on east-train during reduction (sparse): 4B
    var y_rows_east_buf = @zeros([max_local_nnz_rows]u16);

    // final reduced local output vector (dense)
    var y_local_buf = @zeros([local_out_vec_sz]f32);    // 4B
    """

  dtsz_u16 = np.dtype(np.uint16).itemsize  ## 2 bytes
  dtsz_f32 = np.dtype(np.float32).itemsize  ## 4 bytes

  ## input matrix in sparse format
  in_mat_mem = (dtsz_f32 + dtsz_u16) * max_local_nnz + 3 * dtsz_u16 * max_local_nnz_cols
  ## input vector in dense format
  in_vec_mem = 5 * dtsz_f32 * local_in_vec_sz  ## 4 buffers + 1 tx
  ## partial output vector in sparse format
  sp_vec_init_mem = dtsz_u16 * max_local_nnz_rows  ## init/precomputed rows data
  sp_vec_mem = 4 * ((dtsz_f32 + dtsz_u16) * max_local_nnz_rows)  ## 4 sets of buffers
  ## output vector in dense format
  out_vec_mem = dtsz_f32 * local_out_vec_sz

  return in_mat_mem + in_vec_mem + sp_vec_init_mem + sp_vec_mem + out_vec_mem

preprocess.py¶

import numpy as np


# name mapping between spmv kernel and this C code
#   C code           spmv kernel
# ----------------------------------
#  local_nzcols     local_nnzcols
#  local_nzrows     local_nnzrows
#  local_nnz        local_nnz
#  y_rows           y_rows_init_buf
#  A_colloc         mat_col_loc_buf
#  A_collen         mat_col_len_buf
#  A_colidx         mat_col_idx_buf
#  A_rows           mat_rows_buf
#  A_vals           mat_vals_buf
#
def preprocess(
    # A is nrows-by-ncols with nnz nonzeros
    nrows: int,
    ncols: int,
    nnz: int,
    # core rectangle of spmv is fabx-by-faby
    fabx: int,
    faby: int,
    # (csrRowPtr, csrColInd, csrVal) is the CSR representation
    csrRowPtr: np.ndarray,
    csrColInd: np.ndarray,
    # (cscColPtr, cscRowInd, cscVal) is the CSC representation
    cscColPtr: np.ndarray,
    cscRowInd: np.ndarray,
    cscVal: np.ndarray,
):
  """
    Given a spare matrix A of dimension nrows-by-ncols with nnz nonzeros
    and the dimension of core rectangle fabx-by-faby, parition the matrix
    A such that PE(px=j, py=i) contains the submatrix Aij with the
    following quantities:

    local_nzrows: number of nonzero rows
    local_nzcols: number of nonzero columns
    local_nnz: number of nonzero elements
    y_rows[local_nzrows]: nonzero row index
    y_vals[local_nzrows]: not used
    A_colloc[local_nzcols]: prefix sum of A_collen, used to point to A_rows
    A_collen[local_nzcols]: A_collen[j] is number of nonzeros of j-th nonzero columns
    A_colidx[local_nzcols]: column index of nonzero columns
    A_rows[local_nnz]: position of row index of nonzeros in y_rows
    A_vals[local_nnz]: value of nonzeros

    """
  assert csrRowPtr[0] == 0, "CSR must be base-0"
  assert cscColPtr[0] == 0, "CSC must be base-0"
  assert csrRowPtr[nrows] == nnz, "CSR has wrong nnz"
  assert cscColPtr[ncols] == nnz, "CSC has wrong nnz"

  bx = int((ncols + fabx - 1) / fabx)  # number of columns of a block
  by = int((nrows + faby - 1) / faby)  # number of rows of a block

  local_nzrows = np.zeros((faby, fabx, 1), dtype=np.int32)
  local_nzcols = np.zeros((faby, fabx, 1), dtype=np.int32)
  local_nnz = np.zeros((faby, fabx, 1), dtype=np.int32)

  max_grid_dim = max(faby, fabx)
  counted = np.zeros(max_grid_dim, dtype=np.int32)

  # step 1: compute local_ncols and local_nnz
  counted[0:max_grid_dim] = -1  # invalid token
  for col in range(ncols):
    check_token = col
    # col = col_b * bx + col_l
    # where col_b is the column block index
    #       col_l is local column index
    col_b = int(col / bx)
    col_l = col - col_b * bx
    start = cscColPtr[col]
    end = cscColPtr[col + 1]
    for colidx in range(start, end):
      row = cscRowInd[colidx]
      # row = row_b * by + row_l
      # where row_b is the row block index
      #       row_l is local row index
      row_b = int(row / by)
      row_l = row - row_b * by
      local_nnz[(row_b, col_b)] += 1
      # Suppose Aij is block (row_b, col_b)
      # if |{Aij(i, col_l) != 0}| > 0, col_l is a nonzero column in Aij
      # we use counted[row_b] to count only once
      # if Aij(i1, col_l) and Aij(i2, col_l) are nonzero and i1 < i2,
      # only Aij(i1, col_l) adds local_nzcols[(row_b, col_b)]
      if counted[row_b] != check_token:
        # Aij(row_l,col_l) is nonzero
        local_nzcols[(row_b, col_b)] += 1
        counted[row_b] = check_token

  # step 2: compute local_nrows
  counted[0:max_grid_dim] = -1  # invalid token
  for row in range(nrows):
    check_token = row
    # row = row_b * by + row_l
    row_b = int(row / by)
    row_l = row - row_b * by
    start = csrRowPtr[row]
    end = csrRowPtr[row + 1]
    for colidx in range(start, end):
      col = csrColInd[colidx]
      # col = col_b * bx + col_l
      col_b = int(col / bx)
      col_l = col - col_b * bx
      # Suppose Aij is block (row_b, col_b)
      # if |{Aij(row_l, j) != 0}| > 0, row_l is a nonzero row in Aij
      # we use counted[col_b] to count only once
      # if Aij(row_l, j1) and Aij(row_l, j2) are nonzero and j1 < j2,
      # only Aij(row_l, j1) adds local_nzrows[(row_b, col_b)]
      if counted[col_b] != check_token:
        # Aij(row_l,col_l) is nonzero
        local_nzrows[(row_b, col_b)] += 1
        counted[col_b] = check_token

  # step 3: compute maximum dimension of Aij
  max_local_nnz = max(local_nnz.ravel())
  max_local_nnz_cols = max(local_nzcols.ravel())
  max_local_nnz_rows = max(local_nzrows.ravel())

  assert (max_local_nnz < np.iinfo(
      np.uint16).max), "LOCAL NUMBER OF NONZEROS WILL OVERFLOW, TRY USING A LARGER FABRIC"
  assert (max_local_nnz_cols < np.iinfo(
      np.uint16).max), "LOCAL NUMBER OF NZCOLS WILL OVERFLOW, TRY USING A LARGER FABRIC"
  assert (max_local_nnz_rows < np.iinfo(
      np.uint16).max), "LOCAL NUMBER OF NZROWS WILL OVERFLOW, TRY USING A LARGER FABRIC"
  # no data overflows u16, we can convert the data to u16
  local_nnz = local_nnz.astype(np.uint16)
  local_nzrows = local_nzrows.astype(np.uint16)
  local_nzcols = local_nzcols.astype(np.uint16)

  #     spmv kernel                      actual storage in preprocess
  # ------------------------------------------------------------------
  # mat_vals_buf[max_local_nnz]           A_vals[local_nnz]
  # mat_rows_buf[max_local_nnz]           A_rows[local_nnz]
  # mat_col_loc_buf[max_local_nnz_cols]   A_colloc[local_nzcols]
  # mat_col_len_buf[max_local_nnz_cols]   A_collen[local_nzcols]
  # mat_col_idx_buf[max_local_nnz_cols]   A_colidx[local_nzcols]
  # y_rows_init_buf[max_local_nnz_rows]   y_rows[local_nzrows]
  #
  # To prepare the data for spmv, each PE allocates the maximum dimension
  # max_local_nnz, max_local_nnz_cols or max_local_nnz_rows
  A_vals = np.zeros((faby, fabx, max_local_nnz), dtype=np.float32)
  A_rows = np.zeros((faby, fabx, max_local_nnz), dtype=np.uint16)
  A_colloc = np.zeros((faby, fabx, max_local_nnz_cols), dtype=np.uint16)
  A_collen = np.zeros((faby, fabx, max_local_nnz_cols), dtype=np.uint16)
  A_colidx = np.zeros((faby, fabx, max_local_nnz_cols), dtype=np.uint16)
  y_rows = np.zeros((faby, fabx, max_local_nnz_rows), dtype=np.uint16)

  # step 4: compute y_rows
  local_pos = np.zeros((faby, fabx), dtype=np.int32)
  counted[0:max_grid_dim] = -1  # invalid token
  for row in range(nrows):
    check_token = row
    # row = row_b * by + row_l
    row_b = int(row / by)
    row_l = row - row_b * by
    start = csrRowPtr[row]
    end = csrRowPtr[row + 1]
    for colidx in range(start, end):
      col = csrColInd[colidx]
      # col = col_b * bx + col_l
      col_b = int(col / bx)
      col_l = col - col_b * bx
      # Suppose Aij is block (row_b, col_b)
      # if |{Aij(row_l, j) != 0}| > 0, row_l is a nonzero row in Aij
      # we use counted[col_b] to count only once
      if counted[col_b] != check_token:
        # Aij(row_l,col_l) is nonzero
        pos = local_pos[(row_b, col_b)]
        y_rows[(row_b, col_b, pos)] = row_l
        local_pos[(row_b, col_b)] = (pos + 1)  # advance to next nonzero row in Aij
        counted[col_b] = check_token

  # step 5: compute A_colloc, A_colidx, A_colen and A_rows
  #  y_rows is computed in step 4 because A_rows must be constructed by using y_rows

  # "local_pos" keeps track of the position of nonzero column in A_colidx
  local_pos = np.zeros((faby, fabx), dtype=np.int32)
  counted[0:max_grid_dim] = -1  # invalid token
  for col in range(ncols):
    check_token = col
    # col = col_b * bx + col_l
    # where col_b is the column block index
    #       col_l is local column index
    col_b = int(col / bx)
    col_l = col - col_b * bx
    start = cscColPtr[col]
    end = cscColPtr[col + 1]
    for colidx in range(start, end):
      row = cscRowInd[colidx]
      val = cscVal[colidx]
      # row = row_b * by + row_l
      # where row_b is the row block index
      #       row_l is local row index
      row_b = int(row / by)
      row_l = row - row_b * by
      # Suppose Aij is block (row_b, col_b)
      # Aij(row_l,col_l) is nonzero
      if counted[row_b] != check_token:
        # pos = position of nonzero column index in A_colidx and A_colen
        # A_collen[pos] is accumulated nonzero rows
        # A_colidx[pos] is the nonzero local column index
        pos = local_pos[(row_b, col_b)]
        # only record nonzero local column index once
        A_colidx[(row_b, col_b, pos)] = col_l
        # update A_colloc such that
        # A_colloc[0] = 0
        # A_colloc[j] = A_colloc[j-1] + A_colen[j-1]
        if pos > 0:
          A_colloc[(row_b, col_b,
                    pos)] = (A_colloc[(row_b, col_b, pos - 1)] + A_collen[(row_b, col_b, pos - 1)])
        local_pos[(row_b, col_b)] = (pos + 1)  # advance to next nonzero column in Aij
        counted[row_b] = check_token
      # else:
      #   "pos" is still current position of nonzero column index in A_colen

      # Remark: "pos" is well-defined because CSC is sorted in ascending order
      #   if col_l changes, then previous nonzero col_l is done
      #   When the loop enters 1st row_l of in Aij(:, col_l), it defines "pos"
      #   , the subsequent row_l in the same Aij(:, col_l) keeps the same "pos"
      #   When the loop exits Aij, A_collen and A_rows for Aij(:, col_l) is done
      #   When the loop enters Aij again, it re-starts the process for next nonzero
      #   col_l in Aij
      pos_start = A_colloc[(row_b, col_b,
                            pos)]  # position of 1st row index if Aij(:, col_l) in A_rows
      pos_rel_rowidx = A_collen[(row_b, col_b, pos)]  # position of nonzero row index in A_rows
      # corresponding to Aij(:, col_l)
      pos_rowidx = pos_rel_rowidx + pos_start
      # y_rows records distance(y_row.begin, find(y_rows.begin(), y_rows.end(), row_l))
      # spmv uses y_rows to store the result of outer-product of A*x
      y_rows_list = list(y_rows[(row_b, col_b)])
      A_rows[(row_b, col_b, pos_rowidx)] = y_rows_list.index(row_l)
      A_vals[(row_b, col_b, pos_rowidx)] = val
      A_collen[(row_b, col_b, pos)] = (pos_rel_rowidx + 1)  # move to next nonzero Aij(row_l, col_l)

  matrix_info = {}
  matrix_info["nrows"] = nrows  # number of rows of the matrix
  matrix_info["ncols"] = ncols  # number of columns of the matrix
  matrix_info["nnz"] = nnz  # number of nonzeros of the matrix
  matrix_info["max_local_nnz"] = max_local_nnz
  matrix_info["max_local_nnz_cols"] = max_local_nnz_cols
  matrix_info["max_local_nnz_rows"] = max_local_nnz_rows
  matrix_info["mat_vals_buf"] = A_vals
  matrix_info["mat_rows_buf"] = A_rows
  matrix_info["mat_col_loc_buf"] = A_colloc
  matrix_info["mat_col_len_buf"] = A_collen
  matrix_info["mat_col_idx_buf"] = A_colidx
  matrix_info["y_rows_init_buf"] = y_rows
  matrix_info["local_nnz"] = local_nnz
  matrix_info["local_nnz_cols"] = local_nzcols
  matrix_info["local_nnz_rows"] = local_nzrows

  return matrix_info