15418 Final Project Writeup

2.8 < 3..?

Tanay Bennur and Elijah Baraw

Summary

We will implement and optimize the asymptotically efficient ($O(n^{\lg(7)})$) Strassen and Stranssen-Winograd Matrix Multiplication algorithms on GPUs. We will target square and "square-ish" matrices and generate a report comparing the performance of our asymptotically efficient algorithm to existing implementations of the $O(n^3)$ algorithm to see whether the asymptotic advantage outweighs the added algorithmic complexity.

Update 28 April 2025

Our final iteration of our algorithm beats cuBLAS for $N \geq 4096$, and our report is available here.

Update 15 April 2025

There is a large volume of literature on optimizing matrix multiplication, so we began by reading papers to get a high-level understanding of the field as a whole. Eventually, we found an article 1 that gave a good introduction to common matrix multiplication optimizations and included an evaluation framework that we incorporated into our project. We then shifted our attention to the Strassen algorithm itself - it has a half-dozen variants that we looked into in detail to determine which ones are worth implementing. We settled on a pebbling-based variant introduced in 2023 2, and created a preliminary Python approach to produce psuedo-code and a reference for intermediate variables.

We then implemented the pebbled version of the Strassen algorithm in CUDA. Our initial implementation was slow, so we iteratively improved it by increasing memory reuse and merging related computations. We also added the recursive component of the algorithm, further boosting performance. Eventually, we achieved better performance than the CUBLAS library on the GHC 2080's for square matrices with $N \geq 8192$.

We are currently working on improving the performance of this so that we have better-than-CUBLAS performance for $N=4096$, and hopefully even $N=1024$. Beating CUBLAS for smaller $N$ means we can recurse more deeply for larger matrices, using our faster code, so this will allow us to see a compounded advantage.

Background

Matrix Multiplication is a fundamental operation in many algorithms related to Machine Learning, simulation, optimization and solving linear equations.

The majority of common Matrix Multiplication implementations follow the same general algorithm: compute each entry in the result matrix as the dot product of the relevant row and column in the product matrices. These implementations can be optimized via exploiting parallelism, efficient memory access or hardware optimizations, but are ultimately subject to the same $O(n^3)$ asymptomatic bound.

We wish to explore and implement alternative, asymptotically improved matrix multiplication algorithms in a parallel setting. More specifically, the Strassen and Strassen-Winograd algorithms require only $O(n^2.807)$ multiplications and are potentially faster on matrices of dimension > 1024. We hope to explore the utility of these algorithms on square matrices using GPUs.

Time permitting, we will expand our implementation and exploration to squarish matrices and incorporate common matrix multiplication GPU optimizations (e.g. Shared Memory Tiling, Register Tiling, Cooperative Fetching).

The Challenge

Matrix Multiplication demonstrates inherent parallelism because, as is the naive multiplication algorithm, each element of the output can be computed using a single row and column in the input, meaning each element of the output can be computed in parallel. Each output element follows the same execution with no divergence. There is locality across rows and columns, as within an output row, each element corresponds to the same input row, and within each column in the output matrix, elements correspond to the same input column.

The Strassen-Winograd algorithm computes shared terms between the dot products of rows and columns which are used to compute output elements. This results in fewer total multiplications, ie, less work, but at the cost of increased memory overhead and data dependencies. For large matrices (ex, those larger than 1024 x 1024) this has been observed to result in speedups on GPUs, since there's still plenty of parallelism available.

The challenge arises from efficient memory access, management of data dependencies (which the Strassen-Winograd has more compared to the basic multiplication algorithm) and minimizing memory overhead when implementing the Strassen-Winograd algorithm. The shared computation means storing more intermediate results which will be shared between output elements, necessitating more complex memory management and synchronization.

Resources

Starter Code:

There are a few papers that give psuedocode implementations of the algorithm and experimental results, but there are no actual implementations of the Strassen or Strassen-Winograd algorithm. Additionally, existing online implementations have a memory overhead of 2x, and none of them give any information about how to reduce this. We plan to use these papers and psuedocode as a starting point, but will make huge modifications to reduce the memory overhead.

Resources:

We will use the highest quality GPUs we can get access to for this assignment. If necessary, we will use GPUs on the GHC cluster, but access to resources on the PSC supercomputer or other clusters would be preferred. Ideally, we would like to use V100 GPUs to benchmark our results.

Papers:

Goals & Deliverables

Already Achieved:

Implement the pebbled Strassen Algorithm. Reduce memory overhead through intelligent reuse and recomputation. Experimented with the memory hierarchy and various recursion granularities. Profiled various matrix multiplication implementations (only on GHC 2080s).

Plan to Achieve:

To begin, we hope to implement baseline versions of Standard, Strassen and Strassen-Winograd matrix multiplication on V100 GPUs. We will evaluate these implementations, and ideally achieve a memory overhead reduction from 2x to 1.5x. This improvement is theoretically possible by effectively reusing a matrix in recursive code, but is extremely complex due to data dependencies. After this, we will improve our memory accesses via prefetching, setting a minimum granularity for recursive code and effectively using the memory hierarchy. We hope that this gives a significant speedup in our code (but cannot estimate by how much at this time).

Experiment with streaming, prefetching and custom kernels. Outperform cuBLAS on matrices of dimension 4096. Test and profile our implementation on different GPU architectures.

Hope to Achieve:

With sufficient time and progress, we wish to experiment with diverse architectures and workloads. We will primarily be targeting the V100 architecture, but hope to optimize our model's performance on A100, H100 and RTX 2080 architectures if the opportunity arises. If we don't get this opportunity (or we manage to achieve it quickly), we will begin optimizing this algorithm on squarish and irregularly shaped matrices as well.

Outperform cuBLAS on matrices of dimension 1024 and 2048 (potentially dependent on GPU platform).

Platform Choice

Matrix Multiplication can be implemented as an incredibly large number of incredibly simple tasks, making a GPU ideal. We wish to use v100s due to the availability and prevalence of these processors in modern ML environments.

Schedule

Week 0: Research Spike on Strassen Implementation, Strassen-Winograd Implementation, Pebbled Strassen Implementation and Matrix Multiplication Optimizations

Week 1: Refactor timing and evaluation code. Implement baseline approaches for Standard and pebbled Strassen algorithms. Evaluate runtime and memory overhead for baselines on different sized matrices.

Week 2: Implement a version of the algorithms with improved memory overhead via intelligent recomputation and reuse.

Week 3: Continue implementing versions of the algorithms with improved memory overhead via intelligent recomputation and reuse (Elijah & Tanay). Explore streaming (Tanay) and further profiling (Elijah).

Week 3.5: Finalize algorithmic improvements by exploring prefetching (Elijah) and custom kernels (Tanay). Begin writing report (Elijah and Tanay).

Week 4: Built onto report. Explore performance on various GPUs - 2080 (GHC), V100 (PHC; Tanay), smaller GPUs (gaming laptop GPUs? small-scale AI GPUs, like GTX 1660; Elijah)

Week 4.5: Continue writing report and finetuning on diverse architectures. Start working on poster (Elijah and Tanay).

Incoming Links

15418 Final Project.md

Exported from a markdown vault by EKB's vaultparser