Tag Archives: algebra

The Strassen Algorithm in C++

Quite some time ago I wrote about some experiments I did with some matrix multiplication algorithms. I’ve finally got around to cleaning up and posting (most of) the source code I used to generate the data in that post.

The source is now hosted in my GitHub repository here.

In the project is a few toys for playing around with matrices, but the meat is in

src/strassen/strassen_matrix_multiplier.hpp
src/strassen/parallel_strassen_matrix_multiplier.hpp

There’s a few other algorithms in there for comparison’s sake, as well as a simple testing client that provides some timing information. I’m using CMake to generate the Makefiles; if you’ve never used it before, you should. It’s great.

Some things to note:

  • The Strassen algorithm has a very high constant factor; for this reason, for smaller matrices or┬ásub-matrices, the code falls back to an optimized naive multiplication algorithm
  • The algorithm needs to pad matrices with zeroes if they’re either non-square and/or have dimensions which are not a power of two. For this reason, if your input matrices have a square dimension something like N = (2n + 1), the algorithm works with a matrix almost twice that size. For that reason it will perform badly on matrices with this characteristic.
  • This implementation is mostly a demonstrative toy in the hope someone finds it useful; its not particularly optimized and the parallel Strassen class is kind of naive.

Happy hacking!


Strassen’s Algorithm (Theory vs Application part 2)

So last time I mentioned that I didn’t really want to get into anything too complicated. However, I decided to go the whole nine yards and build an implementation of Strassen’s Algorithm for matrix multiplication, to compare it against the more conventional methods I was experimenting with. Thanks to a little help from CLRS I got the job done in a few hours, and began running some benchmarks.

Disclaimer: I am not some kind of rocket scientist who is giving you a guarantee the algorithms I implemented are perfect or super-optimized. I’ll only give you the assurance that they are a reasonable implementation in C++.

Strassen’s algorithm is interesting because given two matrices A and B to be multiplied, it uses divide-and-conquer to break each input matrix into submatrices and then performs seven matrix multiplications on these. Normally we would perform eight matrix multiplications for a naive algorithm. As a result, the asymptotic complexity works out to O(nlog27) = O(n2.8073), which is better (again, in theory) than naive implementations, which run at O(n3).

My initial experiments with Strassen’s were fairly disappointing. It ran significantly more slowly than a transpose-naive algorithm, and consumed considerably more memory. In addition, Strassen’s requires the input matrices to be square, with each dimension a power of two. If A and B are not a power of two, they are padded with zeroes. This adds preliminary overhead to its execution, not to mention it artificially inflates the size of the matrices to be potentially significantly larger, which means more computations. Also, its extremely inefficient on small matrices. So here are my customized hacks to the vanilla algorithm:

  • Input matrices or submatrices which have dimension N <= 256 are multiplied using my transpose-naive algorithm from earlier; larger ones continue with recursive Strassen operations.
  • Sections of the algorithm which focus on multiplying submatrices from zero-padded sections of the input matrices skip computation and return a zero-matrix.

In addition, I also implemented a parallel version of the algorithm:

  • After the initial compartmentalization of the two input matrices, seperate threads handle the subsequent seven recursive submatrix multiplication operations.

My simple benchmarks involved multiplying random matrices of incrementally growing sizes and timing how long it took for each N x N multiplication. Each was repeated numerous times and I took the average. On to the pictures:

As you can see, I had to stop collecting data for the naive algorithm, because it was bringing the experiment to a standstill. We all knew how that one would perform; I thought it would be interesting to see it in perspective. Remember that the Naive and Transpose algorithms are asymptotically equal (!). Here’s a version with the naive data removed:

Some notes:

  • The transpose-naive algorithm performed very well with smaller matrices, however as N grew past 3000 elements, its performance degraded badly.
  • I don’t know why the transpose-naive method resulted in such uneven growth.
  • The Strassen algorithm performed poorly until about N = 3000, at which point it began to outperform the parallel transpose-naive algorithm.
  • Strassen’s Algorithm suffers a sharp performance drop each time N grows past a power of 2, since it needs to expand the input matrices to the next larger power of 2.
  • After N = 2500, the parallel Strassen’s Algorithm was the clear winner.

To see what I mean about the performance penalty suffered by Strassen’s, take a look at the graph zoomed-in between 0 < N < 2500:

At N = 1024, 2048, you can see the sharp jump in runtime in Strassen’s compared to the other conventional algorithms. Unfortunately, I had to end the test around N = 4000 because the Strassen calculation caused my computer to run out of memory. I would make the assumption that similar performance penalties are suffered at larger powers of 2.

So, what are the conclusions? Well, it appears that the long-term clear winner is a parallel implementation of Strassen’s Algorithm. In general, this algorithm (both in serial and parallel) performed well on large matrices, but don’t necessarily outperform more simple algorithms when the matrix size is not close to the upper power of 2. When the matrix is close to the upper power of two, Strassen’s can even outperform parallel versions of simpler methods, even with all its extra nasty overhead. Pretty cool – theory wins.

Update: I’ve posted most of the source code I used for this post here


Theory vs Application

A few days ago I started hacking on a small and fairly simple linear algebra library in C++ (cause that’s how I roll). It does the typical stuff, like various matrix or vector operations. When it came time to implement matrix multiplication, I really didn’t feel like getting into LAPACK and all that stuff, so I threw the functionality together myself. After running some tests with relatively large random matrices (a few thousand rows and columns), I was disappointed with how badly the naive matrix multiplication performed. As I should be, right? After all, we learned back in school that this operation is an O(n3) algorithm.

Well, hold on.

I started thinking about quick ways to improve this performance without getting into anything too complicated. I realized that a huge part of the problem here is how the elements of these matrices are accessed. Your typical matrix multiplication takes the rows of A, the columns of B, and dot-products each of them together into an element of C. However, unless you’re programming in FORTRAN, these column accesses are slow. In row-based array languages, like C++, you want your memory accesses to be as close together as possible, thanks to caching optimizations from your OS and hardware. Jumping all over columns isn’t good for memory locality.

So how do you avoid jumping over columns and access only rows for both matrices? I took the transpose and multiplied the matrices that way. Guess what? Huge (relative) speed improvements. Now, the MM algorithm runs along pairs of rows from A and B to compute the elements of C. What makes this better is that it becomes even more embarrassingly parallel. After throwing in a few pthreads, the performance on my 4-core machine improved even further (although not always quadruple).

The point of this discussion is that even after taking the matrix transpose into account, the second method is significantly faster. If you want to get technical, its complexity works out to O(n3 + n2), which goes to show you that the constant factor in front of the naive implementation must be enormous. Here’s some matrix multiplication results from identically-sized random matrices to illustrate the significant difference between these algorithms:


MM array size: 1023 x 1023
MM time [standard]: 6.200862
MM time [transpose]: 1.138916
MM time [parallel]: 0.323271

MM array size: 2317 x 2317
MM time [standard]: 76.909222
MM time [transpose]: 12.978604
MM time [parallel]: 5.118813

MM array size: 3179 x 3179
MM time [standard]: 220.900443
MM time [transpose]: 33.185827
MM time [parallel]: 13.594514