Fulcrum: A Simplified Control and Access Mechanism Toward Flexible and Practical In-Situ Accelerators

Marzieh Lenjani,Patricia Gonzalez,Elaheh Sadredini,Shuangchen Li,Yuan Xie,Ameen Akel,Sean Eilert,Mircea R. Stan,Kevin Skadron

Fulcrum: A Simplified Control and Access Mechanism Toward Flexible and Practical In-Situ Accelerators

2020

In-situ approaches process data very close to the memory cells, in the row buffer of each subarray. This minimizes data movement costs and affords parallelism across subarrays. However, current in-situ approaches are limited to only row-wide bitwise (or few-bit) operations applied uniformly across the row buffer. They impose a significant overhead of multiple row activations for emulating 32-bit addition and multiplications using bitwise operations and cannot support operations with data dependencies or based on predicates. Moreover, with current peripheral logic, communication among subarrays is inefficient, and with typical data layouts, bits in a word are not physically adjacent. The key insight of this work is that in-situ, single-word ALUs outperform in-situ, parallel, row-wide, bitwise ALUs by reducing the number of row activations and enabling new operations and optimizations. Our proposed lightweight access and control mechanism, Fulcrum, sequentially feeds data into the single-word ALU and enables operations with data dependencies and operations based on a predicate. For algorithms that require communication among subarrays, we augment the peripheral logic with broadcasting capabilities and a previously-proposed method for low-cost inter-subarray data movement. The sequential processor also enables overlapping of broadcasting and computation, and reuniting bits that are physically adjacent. In order to realize true subarray-level parallelism, we introduce a lightweight column-selection mechanism through shifting one-hot encoded values. This technique enables independent column selection in each subarray. We integrate Fulcrum with Compress Express Link (CXL), a new interconnect standard. Fulcrum with one memory stack delivers on average (up to) 23.4 (76) speedup over a server-class GPU, NVIDIA P100, with three stacks of HBM2 memory, (ii) 70 (228) times speedup per memory stack over the GPU, and (iii) 19 (178.9) times speedup per memory stack over an ideal model of the GPU, which only accounts for the overhead of data movement.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations