### SNACKNOC: PROCESSING IN THE COMMUNICATION LAYER

### Karthik Sangaiah, Michael Lui, Ragh Kuttappa, Baris Taskin, and Mark Hempstead



Feb 25<sup>th</sup> 2020

**VLSI** and Architecture Lab

### **Opportunistic Resources for Graduate Students**





### Steak dinner





Opportunistically collecting snacks towards a meal.

### **Opportunistic Resources in the CMP**



Intel Skylake 8180 HCC<sup>[1]</sup>



Opportunistically collecting "snacks" to make a "meal".



Communication

### **Opportunistic Resources in the CMP**





# What is the performance gain we add by opportunistically "snacking" on CMP resources?

- NoC designed to minimize latency during **heavy** traffic
  - NoC implementation can account for 60% to 75% of the miss latency<sup>[2]</sup>

- NoC designed to minimize latency during heavy traffic
  - NoC implementation can account for 60% to 75% of the miss latency<sup>[2]</sup>
- Study of NoC resource utilization on recent NoCs designs
  - 3 selected best paper nominated NoCs have similar performance:
    - DAPPER<sup>[3]</sup>,  $AxNoC^{[4]}$ ,  $BiNoCHS^{[5]}$
  - Reducing resources, substantially reduced performances
    - Further details of study is in our paper

```
[2] Sanchez et al., ACM TACO, 2010.
[3] Raparti et al., IEEE/ACM NOCS, 2018.
[4] Ahmed et al., IEEE/ACM NOCS, 2018.
[5] Mirhosseini et al, IEEE/ACM NOCS, 2017.
```

- NoC designed to minimize latency during heavy traffic
  - NoC implementation can account for 60% to 75% of the miss latency<sup>[2]</sup>
- Study of NoC resource utilization on recent NoCs designs
  - 3 selected best paper nominated NoCs have similar performance:
    - DAPPER<sup>[3]</sup>,  $AxNoC^{[4]}$ ,  $BiNoCHS^{[5]}$
  - Reducing resources, substantially reduced performances
    - Further details of study is in our paper

```
[2] Sanchez et al., ACM TACO, 2010.
[3] Raparti et al., IEEE/ACM NOCS, 2018.
[4] Ahmed et al., IEEE/ACM NOCS, 2018.
[5] Mirhosseini et al, IEEE/ACM NOCS, 2017.
```

 Opportunities in Network-on-Chip Slack



- NoC designed to minimize latency during heavy traffic
  - NoC implementation can account for 60% to 75% of the miss latency<sup>[2]</sup>
- Study of NoC resource utilization on recent NoCs designs
  - 3 selected best paper nominated NoCs have similar performance:
    - DAPPER<sup>[3]</sup>, AxNoC<sup>[4]</sup>, BiNoCHS<sup>[5]</sup>
  - Reducing resources, substantially reduced performances
    - Further details of study is in our paper

```
[2] Sanchez et al., ACM TACO, 2010.
[3] Raparti et al., IEEE/ACM NOCS, 2018.
[4] Ahmed et al., IEEE/ACM NOCS, 2018.
[5] Mirhosseini et al, IEEE/ACM NOCS, 2017.
```

- Opportunities in Network-on-Chip Slack
  - Crossbar



- NoC designed to minimize latency during heavy traffic
  - NoC implementation can account for 60% to 75% of the miss latency<sup>[2]</sup>
- Study of NoC resource utilization on recent NoCs designs
  - 3 selected best paper nominated NoCs have similar performance:
    - DAPPER<sup>[3]</sup>, AxNoC<sup>[4]</sup>, BiNoCHS<sup>[5]</sup>
  - Reducing resources, substantially reduced performances
    - Further details of study is in our paper

```
[2] Sanchez et al., ACM TACO, 2010.
[3] Raparti et al., IEEE/ACM NOCS, 2018.
[4] Ahmed et al., IEEE/ACM NOCS, 2018.
[5] Mirhosseini et al, IEEE/ACM NOCS, 2017.
```

- Opportunities in Network-on-Chip Slack
  - Crossbar
  - Network Links



- NoC designed to minimize latency during heavy traffic
  - NoC implementation can account for 60% to 75% of the miss latency<sup>[2]</sup>
- Study of NoC resource utilization on recent NoCs designs
  - 3 selected best paper nominated NoCs have similar performance:
    - DAPPER<sup>[3]</sup>, AxNoC<sup>[4]</sup>, BiNoCHS<sup>[5]</sup>
  - Reducing resources, substantially reduced performances
    - Further details of study is in our paper

```
[2] Sanchez et al., ACM TACO, 2010.
[3] Raparti et al., IEEE/ACM NOCS, 2018.
[4] Ahmed et al., IEEE/ACM NOCS, 2018.
[5] Mirhosseini et al, IEEE/ACM NOCS, 2017.
```

- Opportunities in Network-on-Chip Slack
  - Crossbar
  - Network Links
  - Internal Buffers





#### **Crossbar Utilization**

- Simulated 16 core CMP with 4 benchmarks representing "low", "medium", "medium-high", "high" traffic
- Crossbar Utilization:
  - Peak utilization (Graph 500): 42% utilization
  - Highest median (Graph 500): 13.3% utilization



#### **Crossbar Utilization**

- Simulated 16 core CMP with 4 benchmarks representing "low", "medium", "medium-high", "high" traffic
- Crossbar Utilization:
  - Peak utilization (Graph 500): 42% utilization
  - Highest median (Graph 500): 13.3% utilization



#### **Crossbar Utilization**

- Simulated 16 core CMP with 4 benchmarks representing "low", "medium", "medium-high", "high" traffic
- Crossbar Utilization:
  - Peak utilization (Graph 500): 42% utilization
  - Highest median (Graph 500): 13.3% utilization

Link Usage %



#### **Crossbar Utilization**

- Simulated 16 core CMP with 4 benchmarks representing "low", "medium", "medium-high", "high" traffic
- Crossbar Utilization:
  - Peak utilization (Graph 500): 42% utilization
  - Highest median (Graph 500): 13.3% utilization
- Link Utilization
  - Peak utilization link (Graph500): 18% utilization
  - Highest median link utilization (LULESH): 3.3% utilization





 $R_{0} = R_{1} = R_{2} = R_{3} = R_{4} = R_{5} = R_{6} = R_{7}$  $R_{8} = R_{9} = R_{10} = R_{10} = R_{11} = R_{12} = R_{13} = R_{14} = R_{15}$ 



#### **Crossbar Utilization**



 R0
 R1
 R2
 R3
 R4
 R5
 R6
 R7

 -- R8
 R9
 R10
 R11
 R12
 R13
 -- R14
 R15

- Simulated 16 core CMP with 4 benchmarks representing "low", "medium", "medium-high", "high" traffic
- Crossbar Utilization:
  - Peak utilization (Graph 500): 42% utilization
  - Highest median (Graph 500): 13.3% utilization
- Link Utilization
  - Peak utilization link (Graph500): 18% utilization
  - Highest median link utilization (LULESH): 3.3% utilization
  - **Buffer Utilization**

- Raytrace : 4% of cycles have localized contention
- 10% utilization during contention
- 3M flits of the 2.4T flits forwarded: buffer utilization reaches 30-55% of the total capacity



Crossbar Utilization

- Simulated 16 core CMP with 4 benchmarks representing "low", "medium", "medium-high", "high" traffic
- Crossbar Utilization:
  - Peak utilization (Graph 500): 42% utilization
  - Highest median (Graph 500): 13.3% utilization
- Link Utilization
  - Peak utilization link (Graph500): 18% utilization
  - Highest median link utilization (LULESH): 3.3% utilization

The SnackNoC platform improves **efficiency** and **performance** of the CMP by offloading data-parallel workloads and "snacking" on network resources.



"Slack" of the Communication Fabric

### The SnackNoC Platform

Experimental Results

Conclusion and Future Considerations

#### Goals:

- Opportunistically "Snack" on existing network resources for additional performance
- Limited additional overhead to uncore
- Minimal or zero interference to CMP traffic
- Opportunistic NoC-based compute platform
  - Limited dataflow engine
  - Applications:
    - Data-parallel workloads used in scientific computing, graph analytics, and machine learning

#### Goals:

- Opportunistically "Snack" on existing network resources for additional performance
- Limited additional overhead to uncore
- Minimal or zero interference to CMP traffic
- Opportunistic NoC-based compute platform
  - Limited dataflow engine
  - Applications:
    - Data-parallel workloads used in scientific computing, graph analytics, and machine learning



Celerity RISC-V SoC<sup>[6]</sup>

#### □ Goals:

- Opportunistically "Snack" on existing network resources for additional performance
- Limited additional overhead to uncore
- Minimal or zero interference to CMP traffic
- Opportunistic NoC-based compute platform
  - Limited dataflow engine
  - Applications:
    - Data-parallel workloads used in scientific computing, graph analytics, and machine learning



Google Cloud TPU<sup>[7]</sup>



### Goals:

- Opportunistically "Snack" on existing network resources for additional performance
- Limited additional overhead to uncore
- Minimal or zero interference to CMP traffic
- Opportunistic NoC-based compute platform
  - Limited dataflow engine
  - Applications:
    - Data-parallel workloads used in scientific computing, graph analytics, and machine learning





#### Google Cloud TPU<sup>[7]</sup>





Celerity RISC-V SoC<sup>[6]</sup>

Intel Skylake SP HCC, Wikichip.
 S. Davidson et al., IEEE Micro, 2018.
 Jouppi et. al, IEEE/ACM ISCA, 2017.

### Goals:

- Opportunistically "Snack" on existing network resources for additional performance
- Limited additional overhead to uncore
- Minimal or zero interference to CMP traffic
- Opportunistic NoC-based compute platform
  - Limited dataflow engine
  - Applications:
    - Data-parallel workloads used in scientific computing, graph analytics, and machine learning





#### Google Cloud TPU<sup>[7]</sup>



Intel Skylake 8180 HCC<sup>[1]</sup>

Steak dinner



Celerity RISC-V SoC<sup>[6]</sup>



### Goals:

- Opportunistically "Snack" on existing network resources for additional performance
- Limited additional overhead to uncore
- Minimal or zero interference to CMP traffic
- Opportunistic NoC-based compute platform
  - Limited dataflow engine
  - Applications:
    - Data-parallel workloads used in scientific computing, graph analytics, and machine learning





Google Cloud TPU<sup>[7]</sup>



Intel Skylake 8180 HCC<sup>[1]</sup>

Steak dinner



Celerity RISC-V SoC<sup>[6]</sup>

Intel Skylake SP HCC, Wikichip.
 S. Davidson et al., IEEE Micro, 2018.
 Jouppi et. al, IEEE/ACM ISCA, 2017.

Added components to a traditional NoC



- Added components to a traditional NoC
  - Central Packet Manager
    - Assemble and issue instruction packets
    - Manages execution state of kernels
    - Located at Memory Controller



- Added components to a traditional NoC
  - Central Packet Manager
    - Assemble and issue instruction packets
    - Manages execution state of kernels
    - Located at Memory Controller
  - Router Compute Units (RCU)
    - Light-weight accumulator-based processing element (PE)
      - Instruction buffering
      - ALU
    - Located in router pipeline



- Added components to a traditional NoC
  - Central Packet Manager
    - Assemble and issue instruction packets
    - Manages execution state of kernels
    - Located at Memory Controller
  - Router Compute Units (RCU)
    - Light-weight accumulator-based processing element (PE)
      - Instruction buffering
      - ALU
    - Located in router pipeline



- Added components to a traditional NoC
  - Central Packet Manager
    - Assemble and issue instruction packets
    - Manages execution state of kernels
    - Located at Memory Controller
  - Router Compute Units (RCU)
    - Light-weight accumulator-based processing element (PE)
      - Instruction buffering
      - ALU
    - Located in router pipeline



- Added components to a traditional NoC
  - Central Packet Manager
    - Assemble and issue instruction packets
    - Manages execution state of kernels
    - Located at Memory Controller
  - Router Compute Units (RCU)
    - Light-weight accumulator-based processing element (PE)
      - Instruction buffering
      - ALU
    - Located in router pipeline
- Added features to a traditional NoC:
  - CPU traffic priority arbitration
  - Available NoC buffers as transient data storage



- Router Compute Units (RCUs)
  - 32-bit accumulator-based processing element
  - Instruction re-ordering and buffering
- Modifications to input buffer queues, allocators, and crossbar



Added to

- Router Compute Units (RCUs)
  - 32-bit accumulator-based processing element
  - Instruction re-ordering and buffering
- Modifications to input buffer queues, allocators, and crossbar



Added to

- Router Compute Units (RCUs)
  - 32-bit accumulator-based processing element
  - Instruction re-ordering and buffering
- Modifications to input buffer queues, allocators, and crossbar



Added to

- Router Compute Units (RCUs)
  - 32-bit accumulator-based processing element
  - Instruction re-ordering and buffering
- Modifications to input buffer queues, allocators, and crossbar



Added to

### **CPU Traffic Priority Arbitration**

- 34
- □ Primary functionality of NoC is to transfer CPU core and memory traffic
  - "Fair" allocators are typically set to select traffic in round-robin
  - Allocators are modified to prioritize CPU traffic over SnackNoC instruction or data traffic



### **Transient Data Storage**

- Input buffers typically have low contention
  - Available buffers and bandwidth can be used as transient storage
    - Useful to keep intermediate results and read-only values on chip



### **Transient Data Storage**

- Input buffers typically have low contention
  - Available buffers and bandwidth can be used as transient storage
    - Useful to keep intermediate results and read-only values on chip



#### Input buffers typically have low contention

Available buffers and bandwidth can be used as transient storage
 Useful to keep intermediate results and read-only values on chip





#### Input buffers typically have low contention

Available buffers and bandwidth can be used as transient storage
 Useful to keep intermediate results and read-only values on chip





39

#### Input buffers typically have low contention

- Available buffers and bandwidth can be used as transient storage
  - Useful to keep intermediate results and read-only values on chip



#### Input buffers typically have low contention

- Available buffers and bandwidth can be used as transient storage
  - Useful to keep intermediate results and read-only values on chip



#### C-code APIs for Matrix-multiply

• • •

snA = sn\_create\_mat(cxt, "A", A, I, m); snB = sn\_create\_mat(cxt, "B", B, m, n); snC = sn\_create\_mat\_mul(cxt, snA, snB);





#### C-code APIs for Matrix-multiply





















□ "Slack" of the Communication Fabric

The SnackNoC Platform

Experimental Results

Conclusion and Future Considerations

# Methodology

#### □ Experiments:

- 1. Assess the performance of SnackNoC
  - How many additional cores worth of performance can SnackNoC provide opportunistically?
- Quantify the performance interference of operating SnackNoC on the CPU cores



Implemented four SnackNoC kernels (SGEMM, Reduction, MAC, SPMV)



Executed 16 multi-threaded benchmarks from PARSEC3, Splash2X, FastForward2 to assess performance interference

SnackNoC is modeled in the gem5 simulation framework

- SnackNoC is modeled in the gem5 simulation framework
- To quantify performance, four SnackNoC kernels executed on:
  - 1. Simulated CMP with the SnackNoC platform
    - Compiled to SnackNoC instructions

- SnackNoC is modeled in the gem5 simulation framework
- To quantify performance, four SnackNoC kernels executed on:
  - 1. Simulated CMP with the SnackNoC platform
    - Compiled to SnackNoC instructions

| SnackNoC Parameters       | Configuration |
|---------------------------|---------------|
| RCU Count                 | 16 RCUs       |
| RCU Freq.                 | 1 GHz         |
| Flit Priority Arbitration | ON/OFF        |

- SnackNoC is modeled in the gem5 simulation framework
- To quantify performance, four SnackNoC kernels executed on:
  - 1. Simulated CMP with the SnackNoC platform
    - Compiled to SnackNoC instructions

| SnackNoC Parameters         | Configuration                        |
|-----------------------------|--------------------------------------|
| RCU Count                   | 16 RCUs                              |
| RCU Freq.                   | 1 GHz                                |
| Flit Priority Arbitration   | ON/OFF                               |
| Simulated CMP<br>Parameters | Configuration                        |
| Core Count                  | 16 in-order cores                    |
| Core Frequency              | 2GHz                                 |
| L1 I&D Cache                | 32KB, 4-way                          |
| L2 Cache                    | 256KB, 4-way                         |
| NoC Topology                | 2D 4x4 Mesh, 4 Memory<br>Controllers |
| NoC Flit Size               | 32B                                  |
| # Virtual Channels          | 4                                    |
| # Buffers                   | 4                                    |

- SnackNoC is modeled in the gem5 simulation framework
- To quantify performance, four SnackNoC kernels executed on:
  - Simulated CMP with the SnackNoC platform
    - Compiled to SnackNoC instructions
  - 2. Native Dell server with Intel Xeon E5-2660
    - C++ multi-threaded with OpenMP

| Native CPU<br>Parameters | Configuration         |  |
|--------------------------|-----------------------|--|
| Processor                | Intel Xeon E5-2660 v3 |  |
| Core Frequency           | 2.6GHz                |  |
| L1 I&D Cache             | 32KB, 8-way           |  |
| L2 Cache                 | 256KB, 8-way          |  |
| L3 Cache                 | 20MB, 20-way          |  |

| SnackNoC Parameters         | Configuration                        |
|-----------------------------|--------------------------------------|
| RCU Count                   | 16 RCUs                              |
| RCU Freq.                   | 1 GHz                                |
| Flit Priority Arbitration   | ON/OFF                               |
| Simulated CMP<br>Parameters | Configuration                        |
| Core Count                  | 16 in-order cores                    |
| Core Frequency              | 2GHz                                 |
| L1 I&D Cache                | 32KB, 4-way                          |
| L2 Cache                    | 256KB, 4-way                         |
| NoC Topology                | 2D 4x4 Mesh, 4 Memory<br>Controllers |
| NoC Flit Size               | 32B                                  |
| # Virtual Channels          | 4                                    |
| # Buffers                   | 4                                    |

 SnackNoC kernels are executed on an increasing number of cores to determine comparable performance of SnackNoC



SnackNoC kernels are executed on an increasing number of cores to determine comparable performance of SnackNoC



- SnackNoC kernels are executed on an increasing number of cores to determine comparable performance of SnackNoC
- CMP performance roughly linear increase with increasing cores, with exception to SPMV



- SnackNoC kernels are executed on an increasing number of cores to determine comparable performance of SnackNoC
- CMP performance roughly linear increase with increasing cores, with exception to SPMV
- Performance gain between 2 and 6 x86 OOO cores



## SnackNoC Area and Power Overhead

- SnackNoC components' RTL implemented, synthesized with Synopsis Design Compiler:
  - 45nm NCSU technology node
  - Operating Freq. 1GHz

| Router Control<br>Unit (RCU)            | Additional<br>Power (%) | Additional Area<br>(%) | Central Packet<br>Manager     | Additional<br>Power (%) | Additional Area<br>(%) |
|-----------------------------------------|-------------------------|------------------------|-------------------------------|-------------------------|------------------------|
| 32-bit Parallel<br>Adder                | 1.14%                   | 1.15%                  | Assembly Logic<br>and Buffers | 0.08%                   | 2.43%                  |
| 32-bit Parallel                         | 1.14%                   | 1.15%                  | Kernel State                  | 0.16%                   | 0.10%                  |
| Subtractor                              |                         |                        | Instruction Buffer            | 10.71%                  | 25.75%                 |
| 32-bit Multiply and<br>Accumulate (MAC) | 2.05%                   | 1.73%                  | Offload Data                  |                         |                        |
| Ordered Instruction<br>Buffer           | 2.05%                   | 2.30%                  | Memory Buffer                 | 0.95%                   | 2.28%                  |
| Dependency Buffer                       | 2.51%                   | 1.15%                  | Output Result<br>FIFO         | 0.95%                   | 2.28%                  |
| Accumulator Buffer                      | 0.68%                   | 0.12%                  | Total                         | 12.85%                  | 33.04%                 |
| Sub Block List                          | 0.23%                   | 1.73%                  |                               |                         |                        |
| Total                                   | <b>9.81</b> %           | 9.33%                  |                               |                         |                        |

## SnackNoC Area and Power Overhead

- SnackNoC components' RTL implemented, synthesized with Synopsis Design Compiler:
   45nm NCSU technology node
   Operating Freq. 1GHz
- □ Single RCU per NoC router
  - Under 10% additional power and area per router

| Router Control<br>Unit (RCU)            | Additional<br>Power (%) | Additional Area<br>(%) | Central Packet<br>Manager     | Additional<br>Power (%) | Additional Area<br>(%) |
|-----------------------------------------|-------------------------|------------------------|-------------------------------|-------------------------|------------------------|
| 32-bit Parallel<br>Adder                | 1.14%                   | 1.15%                  | Assembly Logic<br>and Buffers | 0.08%                   | 2.43%                  |
| 32-bit Parallel                         | 1.14%                   | 1.15%                  | Kernel State                  | 0.16%                   | 0.10%                  |
| Subtractor                              |                         |                        | Instruction Buffer            | 10.71%                  | 25.75%                 |
| 32-bit Multiply and<br>Accumulate (MAC) | 2.05%                   | 1.73%                  | Offload Data                  | 10.7170                 | 23.7370                |
| Ordered Instruction<br>Buffer           | 2.05%                   | 2.30%                  | Memory Buffer                 | 0.95%                   | 2.28%                  |
| Dependency Buffer                       | 2.51%                   | 1.15%                  | Output Result<br>FIFO         | 0.95%                   | 2.28%                  |
| Accumulator Buffer                      | 0.68%                   | 0.12%                  | Total                         | 12.85%                  | 33.04%                 |
| Sub Block List                          | 0.23%                   | 1.73%                  |                               |                         |                        |
| Total                                   | <b>9.8</b> 1%           | 9.33%                  |                               |                         |                        |

# SnackNoC Area and Power Overhead

- SnackNoC components' RTL implemented, synthesized with Synopsis Design Compiler:
   45nm NCSU technology node
  - Operating Freq. 1GHz
- □ Single RCU per NoC router
  - Under 10% additional power and area per router
- □ Single CPM per NoC
  - 12.85% additional power per NoC
  - 33.04% additional area per NoC
    - Largest contributor is instruction buffer

| Router Control<br>Unit (RCU)      | Additional<br>Power (%) | Additional Area<br>(%) | Central Packet<br>Manager     | Additional<br>Power (%) | Additional Area<br>(%) |
|-----------------------------------|-------------------------|------------------------|-------------------------------|-------------------------|------------------------|
| 32-bit Parallel<br>Adder          | 1.14%                   | 1.15%                  | Assembly Logic<br>and Buffers | 0.08%                   | 2.43%                  |
| 32-bit Parallel                   | 1.14%                   | 1.15%                  | Kernel State                  | 0.16%                   | 0.10%                  |
| Subtractor<br>32-bit Multiply and |                         |                        | Instruction Buffer            | 10.71%                  | 25.75%                 |
| Accumulate (MAC)                  | 2.05%                   | 1.73%                  | Offload Data                  |                         |                        |
| Ordered Instruction<br>Buffer     | 2.05%                   | 2.30%                  | Memory Buffer                 | 0.95%                   | 2.28%                  |
| Dependency Buffer                 | 2.51%                   | 1.15%                  | Output Result<br>FIFO         | 0.95%                   | 2.28%                  |
| Accumulator Buffer                | 0.68%                   | 0.12%                  | Total                         | 12.85%                  | 33.04%                 |
| Sub Block List                    | 0.23%                   | 1.73%                  |                               |                         |                        |
| Total                             | 9.81%                   | 9.33%                  |                               |                         |                        |

 Full uncore of 16 core CMP is modeled in 45nm with Cacti 7.0 and Orion 3.0.



Full uncore of 16 core CMP is modeled in 45nm with Cacti 7.0 and Orion 3.0.



Full uncore of 16 core CMP is modeled in 45nm with Cacti 7.0 and Orion 3.0.

 16 RCU SnackNoC only contributes 1.6% and 1.1% power and area, respectively.



Full uncore of 16 core CMP is modeled in 45nm with Cacti 7.0 and Orion 3.0.

 16 RCU SnackNoC only contributes 1.6% and 1.1% power and area, respectively.

Satisfies goal of limited overhead



To quantify performance interference, the performance of the CMP is compared with and without SnackNoC Traffic

| Simulated CMP<br>Parameters | Configuration                        |
|-----------------------------|--------------------------------------|
| Core Count                  | 16 in-order cores                    |
| Core Frequency              | 2GHz                                 |
| L1 I&D Cache                | 32KB, 4-way                          |
| L2 Cache                    | 256КВ, 4-way                         |
| NoC Topology                | 2D 4x4 Mesh, 4 Memory<br>Controllers |
| NoC Flit Size               | 32B                                  |
| # Virtual Channels          | 4                                    |
| # Buffers                   | 4                                    |

| SnackNoC Parameters       | Configuration |
|---------------------------|---------------|
| RCU Count                 | 16 RCUs       |
| RCU Freq.                 | 1 GHz         |
| Flit Priority Arbitration | ON/OFF        |

- To quantify performance interference, the performance of the CMP is compared with and without SnackNoC Traffic
  - Simulated 16 core CMP with benchmarks from PARSEC3, Splash2X, and FastForward2

| Simulated CMP<br>Parameters | Configuration                        |
|-----------------------------|--------------------------------------|
| Core Count                  | 16 in-order cores                    |
| Core Frequency              | 2GHz                                 |
| L1 I&D Cache                | 32KB, 4-way                          |
| L2 Cache                    | 256KB, 4-way                         |
| NoC Topology                | 2D 4x4 Mesh, 4 Memory<br>Controllers |
| NoC Flit Size               | 32B                                  |
| # Virtual Channels          | 4                                    |
| # Buffers                   | 4                                    |

| SnackNoC Parameters       | Configuration |
|---------------------------|---------------|
| RCU Count                 | 16 RCUs       |
| RCU Freq.                 | 1 GHz         |
| Flit Priority Arbitration | ON/OFF        |

- To quantify performance interference, the performance of the CMP is compared with and without SnackNoC Traffic
  - Simulated 16 core CMP with benchmarks from PARSEC3, Splash2X, and FastForward2
  - SnackNoC kernels are simultaneously executed

| Simulated CMP<br>Parameters | Configuration                        |
|-----------------------------|--------------------------------------|
| Core Count                  | 16 in-order cores                    |
| Core Frequency              | 2GHz                                 |
| L1 I&D Cache                | 32KB, 4-way                          |
| L2 Cache                    | 256KB, 4-way                         |
| NoC Topology                | 2D 4x4 Mesh, 4 Memory<br>Controllers |
| NoC Flit Size               | 32B                                  |
| # Virtual Channels          | 4                                    |
| # Buffers                   | 4                                    |

| SnackNoC Parameters       | Configuration |
|---------------------------|---------------|
| RCU Count                 | 16 RCUs       |
| RCU Freq.                 | 1 GHz         |
| Flit Priority Arbitration | ON/OFF        |

#### Minimal impact of "Snacking" on CMP performance





Performance impact varies based on NoC utilization



- Performance impact varies based on NoC utilization
  - Peak 1.1% performance impact on CMP cores



- Performance impact varies based on NoC utilization
  - Peak 1.1% performance impact on CMP cores
  - On average ~0.30% for SGEMM, MAC, SPMV. On average 0.11% for Reduction



- Performance impact varies based on NoC utilization
  - Peak 1.1% performance impact on CMP cores
  - On average ~0.30% for SGEMM, MAC, SPMV. On average 0.11% for Reduction
- □ SnackNoC kernel completion time impacted at most 3.9% with fair arbitration

















- Adding priority flit arbitration for CMP traffic:
  - Average performance impact drops from 0.25% to 0.17%



- □ Adding priority flit arbitration for CMP traffic:
  - Average performance impact drops from 0.25% to 0.17%
  - Improves flit interference by up to 92%



- □ Adding priority flit arbitration for CMP traffic:
  - Average performance impact drops from 0.25% to 0.17%
  - Improves flit interference by up to 92%
  - Peak performance impact with priority arbitration is 0.83%



- □ Adding priority flit arbitration for CMP traffic:
  - Average performance impact drops from 0.25% to 0.17%
  - Improves flit interference by up to 92%
  - Peak performance impact with priority arbitration is 0.83%

Satisfies goal of limited performance impact

d 3quared Water-Spatial Water-Spatial Graph500 Average

SPMV

SPMV w. Prio. Arb.



□ "Slack" of the Communication Fabric

The SnackNoC Platform

Experimental Results

Conclusion and Future Considerations

## **Conclusion and Future Considerations**

- Opportunistically "snacking" on NoC resources can add performance to our CMPs
  - Added 2 to 6 cores of performance with only a 1.3% increase of the uncore area



## **Conclusion and Future Considerations**

- 90
- Opportunistically "snacking" on NoC resources can add performance to our CMPs
  - Added 2 to 6 cores of performance with only a 1.3% increase of the uncore area
- Further tradeoffs we're investigating:
  - 1. Growing application coverage
  - 2. Scaling compute density
  - 3. Supporting future topologies



### Questions?

Main Contributions:

- Quantified design slack in the communication fabric
- Opportunistically adds 2 to 6 core performance to the CMP by repurposing NoC resources with low overhead

Karthik Sangaiah, Michael Lui, Ragh Kuttappa, Baris Taskin [Drexel University], and Mark Hempstead [Tufts University], "SnackNoC: Processing in the Communication Layer", Proceedings of the IEEE international Symposium on High Performance Computer Architecture (HPCA), February 2020.

<u>http://vlsi.ece.drexel.edu/ & https://sites.tufts.edu/tcal/</u>