
ASIC design for a 2x2 systolic matrix multiplier supporting multiply and accumulate operations on int8 data alongside a design for test infrastructure to help debug both usage and diagnose design issues in silicon.
This MAC accelerator operates at up to 50MHz and is capable of reaching up to 100MAC/s or 200MIOP/s.
The goal of the MAC accelerator is to perform a matrix matrix multiplication between the input data matrix $I$ and the weight matrix $W$.
\begin{gather}
I \times W = R \\
\begin{pmatrix}
i_{0,0} & i_{1,0} \\
i_{0,1} & i_{1,1}
\end{pmatrix}
\times
\begin{pmatrix}
w_{0,0} & w_{1,0} \\
w_{0,1} & w_{1,1}
\end{pmatrix} =
\begin{pmatrix}
i_{0,0}w_{0,0}+i_{1,0}w_{0,1} & i_{0,0}w_{1,0}+i_{1,0}w_{1,1}\\
i_{0,1}w_{0,0}+i_{1,1}w_{0,1} & i_{0,1}w_{1,0}+i_{1,1}w_{1,1}\end{pmatrix}
\end{gather}
This MAC accelerator has 4 units and from this point on, we will refer to each MAC unit according to their unique $(x,y)$ coordinates.
Each MAC unit calculates the MAC operation $c_{(t,x,y)}$, where :
c_{(t,x,y)} = i_{(t,y)} \times w_{(x,y)} + c_{(t-1,x,y-1)}
Given this accelerator was designed to operate on signed 8-bit integers, but that the successive application of the 8-bit multiplication and addition pushes the resulting value up to 17 bits, in order to prevent the size of the base datatype from increasing with each successive MAC operation, we need to clamp it down back within the 8-bit range.
As such, the MAC unit performs an additional clamping function $clamp_{i8}$ that remaps :
clamp_{i8}(c_{(t,x,y)}) = \begin{cases}
127 &\text{if } c_{(t,x,y}) > 127\\
c_{(t,x,y)} &\text{if } c_{(t,x,y)} \in [-128,127] \\
-128 &\text{if } c_{(t,x,y}) < -128\\
\end{cases}
Our final full MAC operation is as follows :
c_{(t,x,y)} = clamp_{i8}(i_{(t,y)} \times w_{(x,y)} + c_{(t-1,x,y-1)})
At each MAC timestep $t+1$ :
This data streaming allows such designs to make more efficient use of data, re-using it multiple times as the data circulates through the array, contributing to the final results without spending time on expensive data accesses, allowing us to dedicate more of our silicon area and cycles to compute.
Assuming a pre-configured $W$ weight matrix is being reused and the accelerator is receiving a gapless stream of multiple $I$ input matrices, this MAC accelerator is capable of computing up to 100 MMAC/s or 200 MIOPS/s.
Accelerator operations are stalled if a MAC operation has a data dependency on data that has yet to arrive. For example, calculating $r_{(0,0)}$ depends on both $i_{(0,0)}$ and $i_{(1,0)}$. In practice, each operation depends on two pieces of input data, yet our input interface being only 8 bits wide allows us to transfer only a single $i_{(x,y)}$ per cycle.
This limitation means our accelerator is actually operating at half maximum capacity due to this IO bottleneck. If the IO interface were either (a) at least 16 bits wide, or (b) 8 bits wide but operating at 100 MHz, resolving this bottleneck, our maximum throughput would be 200 MMAC/s or 400 MIOPS/s
The typical sequence to offload matrix operations to the accelerator would go as follows:
This design doesn't feature on-chip SRAM and has limited on-chip memory. Given weights have high spatial and temporal locality, this design allows each weight to be configured per MAC unit. This configuration can be reused across multiple matrices. The input matrix, on the other hand, is expected to be provided on each usage.
Given our input and output data buses are only 8 bits wide, for data transfers to and from the chip the matrices are flattened in the following order:
Notes:
cycles below are clocked according to the clk pin.data_v_i would go low in the middle of the transfer of both the input matrix and the weights, are supported.Given we are not sending an index alongside each data transfer to indicate which weight/data coordinates ( index ) each data corresponds to, the MAC accelerator keeps track of the next index internally. As such, if due to external reasons a partial transfer occurs, it becomes necessary to reset this index using the reset sequence described below.
The weights streaming indexes and the data streaming indexes can be reset independently, each requires a single data transfer cycle during which :
data_v_i is set to 1data_mode_i is set to 0 if we are resetting the weights indexes, 1 to reset the data indexesdata_i[7:0] is ignoreddata_rst_addr_i is set to 1In this example we are resetting both the data streaming index and the weight index back to back.

Configuring the weights takes 4 data transfer cycles, during which :
data_v_i is set to 1data_mode_i is set to 0 indicating we are sending weightsdata_i[7:0] contains the weightsdata_rst_addr_i is set to 0In this example we are configuring the the weight matrix $W$ to :
W =
\begin{pmatrix}
0 & 1 \\
2 & 3
\end{pmatrix}

The implemented JTAG TAP can be used to easily debug the weight matrix configuration sequence as it allows the user using the USER_REG instruction to
read the currently configured weights for each MAC unit.
In the existing openocd helper scripts located at jtag/openocd.cfg the read_user_reg can we used to read the weights using openocd when used as follows :
set r 0
for {set u 0} {$u <= $USER_REG_UNIT_MAX} {incr u} {
puts "read internal register $u : 0x[read_user_reg $_CHIPNAME $u $r] - [print_reg_id $r]"
}
For the $W$ weight matrix configured in the example above, the expected output should be :
read internal register 0 : 0x00 - weight
read internal register 1 : 0x01 - weight
read internal register 2 : 0x02 - weight
read internal register 3 : 0x03 - weight
Sending the input matrix takes 4 data transfer cycles, during which :
data_v_i is set to 1data_mode_i is set to 1 indicating we are sending the input matrixdata_i[7:0] contains the input datadata_rst_addr_i is set to 0In this example we are sending the the input data matrix $I$ :
I =
\begin{pmatrix}
4 & 5 \\
6 & 7
\end{pmatrix}

When receiving a result the asic will drive the following pins during 4 data transfer cycles :
res_v_o is set to 1res_o[7:0] contains the result of the MAC operation for a single matrix coordinateInternally, the accelerator takes at most 4 cycles to produce a result operate on incomming data, this accounts for incomming data latching and circulating the data though the entire systolic array and output streaming. The accelerator moves the incomming data though the array as soon as it is available. Because of this, and since this accelerator supports gaps in the incomming data if that last data transfer of $i_{(1,1)}$ is delayed by at least 2 cycles, then the accelerator result will start streaming out before all of the input matrix has finished streaming in.
If the user sends a gapless, uninterupted stream of input data, and re-uses the same weights, this accelerator is capable of up to 100MMAC/s or 200MIOPS/s.
In this example the $W$ MAC weight matrix is being configured and the $I$ data is being streamed in, following which, the $R$ result starts being sent out.
R = I \times W =
\begin{pmatrix}
4 & 5 \\
6 & 7
\end{pmatrix}
\times
\begin{pmatrix}
0 & 1 \\
2 & 3
\end{pmatrix}
=
\begin{pmatrix}
10 & 19 \\
14 & 27
\end{pmatrix}

Internally, the accelerator takes at most 4 cycles to produce a result from incoming data. This accounts for incoming data latching, circulating the data through the entire systolic array, and output streaming. The accelerator moves the incoming data through the array as soon as it is available. Because of this, and since this accelerator supports gaps in the incoming data stream, if the last data transfer of $i_{(1,1)}$.
This is why, in the firmware (firmware/main.c), we set up the DMA stream to receive the data before we start sending the input matrix, as the gap between sending and getting the result is too small for the controlling MCU to perform any type of compute.

| # | Input | Output | Bidirectional |
|---|---|---|---|
| 0 | tck | result_o | data_i[7] |
| 1 | data_i[0] | result_o | data_valid_i |
| 2 | data_i[1] | result_o | data_mode_i |
| 3 | data_i[2] | result_o | data_rst_addr_i |
| 4 | data_i[3] | result_o | tdi |
| 5 | data_i[4] | result_o | tms |
| 6 | data_i[5] | result_o | tdo |
| 7 | data_i[6] | result_o | result_v_o |