Tag Archives: caching

Theory vs Application

A few days ago I started hacking on a small and fairly simple linear algebra library in C++ (cause that’s how I roll). It does the typical stuff, like various matrix or vector operations. When it came time to implement matrix multiplication, I really didn’t feel like getting into LAPACK and all that stuff, so I threw the functionality together myself. After running some tests with relatively large random matrices (a few thousand rows and columns), I was disappointed with how badly the naive matrix multiplication performed. As I should be, right? After all, we learned back in school that this operation is an O(n3) algorithm.

Well, hold on.

I started thinking about quick ways to improve this performance without getting into anything too complicated. I realized that a huge part of the problem here is how the elements of these matrices are accessed. Your typical matrix multiplication takes the rows of A, the columns of B, and dot-products each of them together into an element of C. However, unless you’re programming in FORTRAN, these column accesses are slow. In row-based array languages, like C++, you want your memory accesses to be as close together as possible, thanks to caching optimizations from your OS and hardware. Jumping all over columns isn’t good for memory locality.

So how do you avoid jumping over columns and access only rows for both matrices? I took the transpose and multiplied the matrices that way. Guess what? Huge (relative) speed improvements. Now, the MM algorithm runs along pairs of rows from A and B to compute the elements of C. What makes this better is that it becomes even more embarrassingly parallel. After throwing in a few pthreads, the performance on my 4-core machine improved even further (although not always quadruple).

The point of this discussion is that even after taking the matrix transpose into account, the second method is significantly faster. If you want to get technical, its complexity works out to O(n3 + n2), which goes to show you that the constant factor in front of the naive implementation must be enormous. Here’s some matrix multiplication results from identically-sized random matrices to illustrate the significant difference between these algorithms:


MM array size: 1023 x 1023
MM time [standard]: 6.200862
MM time [transpose]: 1.138916
MM time [parallel]: 0.323271

MM array size: 2317 x 2317
MM time [standard]: 76.909222
MM time [transpose]: 12.978604
MM time [parallel]: 5.118813

MM array size: 3179 x 3179
MM time [standard]: 220.900443
MM time [transpose]: 33.185827
MM time [parallel]: 13.594514

Advertisements