# GEMV Tutorial 2: Memory DSDs

## Contents

# GEMV Tutorial 2: Memory DSDs¶

Now that we’ve written a complete program, let’s introduce a central concept in CSL: memory Data Structure Descriptors (DSDs). Memory DSDs provide an efficient mechanism for performing operations on entire tensors.

## Learning objectives¶

After completing this tutorial, you should know how to:

Define memory DSDs for tensor accesses

Use memory DSDs in builtin operations on tensors

Use builtins to initialize tensors

## Example overview¶

Our program will run on a single processing element (PE). Like the previous tutorial, we will demonstrate the program with a simulated fabric consisting of an 8 x 3 block of PEs.

Our problem steps are identical to the previous tutorial.
Our layout file, host code, and compile and run commands
are also identical.
We only need to modify `pe_program.csl`

, and we’ll take
a closer look at changes to this file.

## Modifying the CSL¶

In the previous tutorial, we created a complete CSL program using a
single PE to initialize and compute `y = Ax + b`

.
What do we need to do in `pe_program.csl`

to take advantage of
memory DSDs and builtin operations on tensors?

We need to define DSDs for accessing our tensors

We need to rewrite the

`gemv`

function to operate on these DSDs

We previously walked through `layout.csl`

, which is the same for
this tutorial as the previous one.
We include the new `pe_program.csl`

below, and highlight the
changes in this code.

```
param memcpy_params: comptime_struct;
// memcpy module provides infrastructure for copying data
// and launching functions from the host
const sys_mod = @import_module("<memcpy/memcpy>", memcpy_params);
// Constants definining dimensions of our matrix
const M: i16 = 4;
const N: i16 = 6;
// 48 kB of global memory contain A, x, b, y
var A: [M*N]f32; // A is stored row major
// Initialize x, b, y using builtins
var x = @constants([N]f32, 1.0);
var b = @constants([M]f32, 2.0);
var y = @zeros([M]f32);
// DSDs for accessing A, b, y
var b_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> b[i] });
var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> y[i] });
// A_dsd accesses column of A
var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i*N] });
// ptr to y will be advertised as symbol to host
const y_ptr: [*]f32 = &y;
// Initialize A matrix
fn initialize() void {
// for loop with range syntax
for (@range(i16, M*N)) |idx| {
A[idx] = @as(f32, idx);
}
}
// Compute gemv
fn gemv() void {
// Loop over all columns of A
for (@range(u16, N)) |i| {
// Calculate contribution to A*x from ith column of A, ith elem of x
@fmacs(y_dsd, y_dsd, A_dsd, x[i]);
A_dsd = @increment_dsd_offset(A_dsd, 1, f32);
}
// Add b to A*x
@fadds(y_dsd, y_dsd, b_dsd);
}
// Call initialize and gemv functions
fn init_and_compute() void {
initialize();
gemv();
sys_mod.unblock_cmd_stream();
}
comptime {
@export_symbol(y_ptr, "y");
@export_symbol(init_and_compute);
}
```

### Defining our memory DSDs¶

First, let’s take a look at the DSDs we define for accessing `b`

and `y`

:

```
var b_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> b[i] });
var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> y[i] });
```

`b_dsd`

and `y_dsd`

are the memory DSDs for
accessing `b`

, and `y`

, respectively.

The `tensor_access`

field defines the access pattern of these DSDs.
`|i|`

specifies the induction variable, and `{M}`

specifies
the loop bound; i.e., these DSDs will access `M`

elements.
After `->`

, an expression is given for accessing a memory location
using the induction variable.
This expression must be **affine**, or linear plus a constant.

The access pattern for these DSDs is straightforward: these DSDs
loop over all `M`

elements, in order, of their respective tensors.

Now let’s take a look at the DSD for accessing `A`

:

```
var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i*N] });
```

This DSD accesses `M`

elements of `A`

, but strided by `N`

elements;
i.e., `A_dsd`

accesses elements `0, N, 2*N, ... (M-1)*N`

.
Because `A`

is stored in row major format, this means that `A_dsd`

as defined here accesses the 0th column of `A`

.

Note

These memory DSDs are of type `mem1d_dsd`

, which are one-dimensional
memory DSDs. CSL also provides `mem4d_dsd`

, multidimensional memory
DSDs for up to four dimensions.

You can learn more about memory DSDs in our language reference guide Data Structure Descriptors.

### Using our DSDs to compute GEMV¶

Now that we’ve defined our DSDs, let’s take a look at how to use them to compute GEMV.

Recall that our previous `gemv()`

function was defined as follows:

```
fn gemv() void {
for (@range(i16, M)) |i| {
var tmp: f32 = 0.0;
for (@range(i16, N)) |j| {
tmp += A[i*N + j] * x[j];
}
y[i] = tmp + b[i];
}
}
```

Now, our `gemv()`

looks like this:

```
fn gemv() void {
for (@range(u16, N)) |i| {
@fmacs(y_dsd, y_dsd, A_dsd, x[i]);
A_dsd = @increment_dsd_offset(A_dsd, 1, f32);
}
@fadds(y_dsd, y_dsd, b_dsd);
}
```

Notice that we now only have one explicit loop over `N`

,
instead of two explicit loops.
At each iteration, this `@fmacs`

operation does the following:

performs a vector-scalar multiplication between the column of

`A`

referenced by`A_dsd`

and the scalar`x[i]`

,performs an elementwise vector addition between this result and the vector

`y`

,and stores this final result into

`y`

.

Thus, each `@fmacs`

operation increments the `M`

elements of `y`

by the vector-scalar product of column `i`

of `A`

and element `i`

of `x`

.

The `@increment_dsd_offset`

operation at each loop iteration increments
`A_dsd`

to reference the next column of `A`

.
This builtin operation takes `A_dsd`

and creates a new DSD by offseting
its access by 1 `f32`

element.

For instance, the first time this operation occurs, `A_dsd`

will now
access elements `1, N+1, 2*N+1, ... (M-1)*N+1`

of `A`

.
Again, because `A`

is stored row major,
this will access the 1st column of `A`

.

Once this loop over the `N`

columns of `A`

is complete,
`y`

contains the result of `A*x`

.
The `@fadds`

operation performs an elementwise vector addition between
`y`

and `b`

, storing the result back in `y`

.
Now `y`

contains the result of `A*x + b`

.

### Using builtins to initialize tensors¶

You may have noticed one other slight change to this code.
Instead of initializing `x`

, `b`

, and `y`

, in the `initialize`

function,
we make use of builtins to provide values for them at declaration:

```
var x = @constants([N]f32, 1.0);
var b = @constants([M]f32, 2.0);
var y = @zeros([M]f32);
```

The `@constants`

builtin returns a tensor of the specified type,
with all elements initialized to the specified value.
Thus, `x`

is initialized as an `N`

element tensor of all ones,
and `b`

is initialized as an `M`

element tensor of all twos.

The `@zeros`

builtin is rather obvious. `y`

is initialized as an `M`

element tensor of all zeros.

## Compiling and running the program¶

As with the previous tutorial, we compile and run this code using:

```
$ cslc layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 --memcpy --channels=1 -o out
$ cs_python run.py --name out
```

You should see a `SUCCESS!`

message at the end of execution.

## Exercises¶

`A`

is stored row-major in the above code.
How would you rewrite `A_dsd`

and the `gemv`

function
if `A`

were stored column major instead?

## Next¶

In the next tutorial, we’ll introduce host-to-device `memcpy`

,
and copy host-initialized values for `A`

, `x`

, and `b`

onto
the device.