So last time I mentioned that I didn’t really want to get into anything too complicated. However, I decided to go the whole nine yards and build an implementation of Strassen’s Algorithm for matrix multiplication, to compare it against the more conventional methods I was experimenting with. Thanks to a little help from CLRS I got the job done in a few hours, and began running some benchmarks.
Disclaimer: I am not some kind of rocket scientist who is giving you a guarantee the algorithms I implemented are perfect or super-optimized. I’ll only give you the assurance that they are a reasonable implementation in C++.
Strassen’s algorithm is interesting because given two matrices A and B to be multiplied, it uses divide-and-conquer to break each input matrix into submatrices and then performs seven matrix multiplications on these. Normally we would perform eight matrix multiplications for a naive algorithm. As a result, the asymptotic complexity works out to O(nlog27) = O(n2.8073), which is better (again, in theory) than naive implementations, which run at O(n3).
My initial experiments with Strassen’s were fairly disappointing. It ran significantly more slowly than a transpose-naive algorithm, and consumed considerably more memory. In addition, Strassen’s requires the input matrices to be square, with each dimension a power of two. If A and B are not a power of two, they are padded with zeroes. This adds preliminary overhead to its execution, not to mention it artificially inflates the size of the matrices to be potentially significantly larger, which means more computations. Also, its extremely inefficient on small matrices. So here are my customized hacks to the vanilla algorithm:
- Input matrices or submatrices which have dimension N <= 256 are multiplied using my transpose-naive algorithm from earlier; larger ones continue with recursive Strassen operations.
- Sections of the algorithm which focus on multiplying submatrices from zero-padded sections of the input matrices skip computation and return a zero-matrix.
In addition, I also implemented a parallel version of the algorithm:
- After the initial compartmentalization of the two input matrices, seperate threads handle the subsequent seven recursive submatrix multiplication operations.
My simple benchmarks involved multiplying random matrices of incrementally growing sizes and timing how long it took for each N x N multiplication. Each was repeated numerous times and I took the average. On to the pictures:
As you can see, I had to stop collecting data for the naive algorithm, because it was bringing the experiment to a standstill. We all knew how that one would perform; I thought it would be interesting to see it in perspective. Remember that the Naive and Transpose algorithms are asymptotically equal (!). Here’s a version with the naive data removed:
- The transpose-naive algorithm performed very well with smaller matrices, however as N grew past 3000 elements, its performance degraded badly.
- I don’t know why the transpose-naive method resulted in such uneven growth.
- The Strassen algorithm performed poorly until about N = 3000, at which point it began to outperform the parallel transpose-naive algorithm.
- Strassen’s Algorithm suffers a sharp performance drop each time N grows past a power of 2, since it needs to expand the input matrices to the next larger power of 2.
- After N = 2500, the parallel Strassen’s Algorithm was the clear winner.
To see what I mean about the performance penalty suffered by Strassen’s, take a look at the graph zoomed-in between 0 < N < 2500:
At N = 1024, 2048, you can see the sharp jump in runtime in Strassen’s compared to the other conventional algorithms. Unfortunately, I had to end the test around N = 4000 because the Strassen calculation caused my computer to run out of memory. I would make the assumption that similar performance penalties are suffered at larger powers of 2.
So, what are the conclusions? Well, it appears that the long-term clear winner is a parallel implementation of Strassen’s Algorithm. In general, this algorithm (both in serial and parallel) performed well on large matrices, but don’t necessarily outperform more simple algorithms when the matrix size is not close to the upper power of 2. When the matrix is close to the upper power of two, Strassen’s can even outperform parallel versions of simpler methods, even with all its extra nasty overhead. Pretty cool – theory wins.
Update: I’ve posted most of the source code I used for this post here