The “Real” Science Gap

This article was circulating around Twitter the other day – its called The Real Science Gap and attempts to explain (in vast detail) some of the real issues behind the perceived science and engineering “shortages” in the United States. The premise is that the labour market, job shortages, universities and the government take advantage of scientists passing through academia, ultimately creating a culture which drives away potential scientific talent. That there isn’t a scientific shortage – but in fact, an oversupply for the design of the current system. This article is specific to the United States, although I’m sure much of it is applicable to Canada too.

The current approach — trying to improve the students or schools — will not produce the desired result, the experts predict, because the forces driving bright young Americans away from technical careers arise elsewhere, in the very structure of the U.S. research establishment.

Apparently, the growing average length of time it takes to complete a postdoctoral fellowship, combined with the falling incentive to actually remain that long in academia cause potential researchers to pursue alternate career paths. What complicates things, however, is that this situation is still extremely attractive for foreign postdocs who visit the states to study and research only on temporary visas. Thus, the talent built in American universities isn’t necessarily available for long-term research.

The obstacles facing today’s young scientists therefore don’t constitute temporary aberrations but rather are structural features of a system that evolved over a period of 60 years and now meets the needs of major interest groups within the existing structure of law and regulation. Essentially, this system provides a continuing supply of exceptionally skilled labor at artificially low prices, permitting the federal government to finance research at low cost.

What I also found interesting is that this article also turned upside-down some common “knowledge” about the quality of American primary and secondary-school education – arguing that it isn’t, in fact, inferior to those of other Western nations. Only that the comparisons were done incorrectly, and that American students actually rank equally or above counterparts in Europe or elsewhere. In addition, the number of students graduating from post-secondary programs in science and engineering isn’t falling, only that these students choose not to pursue long-term careers in those fields.

Anyways, I mentioned above that its a very long and laborious read; however it’s a completely different perspective on an old story I think a lot of us are used to hearing. In addition, the comment thread is definitely worth skimming through – there are a lot of interesting opinions in there.


Jack Dorsey Talking at Digg

Today, Ken sent me this video of a tech talk by Jack Dorsey while visiting Digg. I usually don’t have the patience to sit through 40-minute video lectures, but I did watch this one in its entirety.

Dorsey was informally talking about some of his experiences leading up to his founding of Twitter, the challenges they faced early on (especially after their feature at, and subsequent popularity after, SXSW in 2007) and some of the approaches they took to resolve them. I’d recommend watching it as well. Let me summarize some of the points Dorsey made which I found particularly interesting:

We didn’t really turn the company and technology around for a year. And, Some of the reasons for that were that we had no instrumentation at all. We had no idea of what was going on in the system.

Apparently Twitter was going down daily after the SXSW conference, and the engineers at Twitter were not able to isolate all the bottlenecks and points of failure in their systems because of inadequate system monitoring, logging, analysis, etc. Dorsey says that at his new company (Square), he learned from the above mistake, and one of their initial priorities was building an exceptional dashboard which became a critical component of development at the company.

That’s a pretty tough situation to be in, especially for a new company with limited resources. When you’re a small company on a limited budget, you need to make all that data accessible in order to make the right decisions. You need to be able to pinpoint every fault, every corner case, and know everything about them. That’s the difference between having to buy a new set of webservers and realizing your caching policy is just designed poorly.

Next:

We’ve had some bad engineering discipline, where we would try to isolate ourselves a bit too much in our work. So, we had some tendencies to really work on singular pieces of code [sic] and it came to the point where there was only one person in the company who knew how the entire system worked. That create a huge single point of failure. We were so afraid that any little change would bring down the entire site.

Twitter approached this problem simultaneously by creating two teams:

  1. The architecture team: redesign the entire Twitter system from the ground-up, with the knowledge they’d gained
  2. The iterations team: build instrumentation for every aspect of the system, and incrementally fix and improve it

Now, I would have thought that re-designing a broken system from the ground-up would have resulted in the best approach to the problem. What’s interesting is that the architecture team failed to produce a viable alternative, and in the meantime, the iterations team actually fixed all the problems using meticulous system analytics and group review of code. In addition, part of the solution involved the iterations team adopting pair programming. In addition, the pairs of people constantly rotated around the company, so that everyone had at least some sense of how the entire system worked. As a result, they had a lot more people talking, more perspectives on problems, and in general, better solutions. In addition, it avoided the problem of isolating people. Dorsey says that it created a very creative environment because of the constant exchange of ideas.

I, like most people who write software, make that “…ew” face when I hear someone mention PP. However, I realized that there is no better way to get more brains solving more common problems than the solution Twitter came up with. Big problems would receive much greater exposure to different perspectives, different experiences, and general discussion than if the conversation never left a small team (or individual) who was responsible for fixing them. I had an epiphany, kind of like that time Gerry invited Ian and I to that design patterns meeting. Lots of good ideas that solve real problems.

Anyways, I hope you watch the video and find it as interesting as I did. It’s a good story.


Intern Hunting

Today, Byron and I spent all day at the University of Waterloo interviewing potential interns. We had quite a variety of people with different (and impressive) backgrounds in computer science, mathematics, and engineering on hand for a technical discussion and some information about life working at Thoora.

Now, I don’t particularly like interviewing people, but I do enjoy talking to smart people. These people were smart. I was very impressed with the students, and actually had a reasonably good time under the circumstances.

Oh, and of course, followed up by an obligatory visit to the Mongolian Grill. Who knew that grilled Asian food wrapped in Mexican tortillas would turn out so well?


The Curiously Recurring Template Pattern

The CRTP is a C++ idiom which is both confusing when you read about it, but straightforward enough that you may have already used it without realizing it. Its kind of funny that way; a strange recursive templating technique.

I was working on some new system at work which involved some tedious class hierarchies in performance-critical components. While doing some research, I fell up on the CRTP wiki page while researching the enduring battle between the performance of polymorphic code and the complexities of modern CPU architectures.

Hold on a minute. Some background first.

Dynamic polymorphism is a language abstraction in which different types of objects can use generic interfaces, and the type of object (and thus, the nature of the work to be done) is determined at runtime. This is achieved in C++ though the use of virtual functions, which in turn have entries in a class virtual function tables. Because parent classes and sibling classes can share virtual function names, the purpose of a vtable is for a class to say “my implementation of function f() is located here.” When an object of a particular type is instantiated, the compiler throws into it a pointer to the vtable for the proper class. Class functions which are non-virtual do not require entries in a vtable because there is only one implementation and location of that function.

Static polymorphism is a language abstraction where different types of objects are determined at compile-time, and as a result do not require work at runtime to determine the type of the object or variable. As a result, class hierarchies built-in this fashion don’t require a vtable. This is where the CRTP comes in.

Say you have a class hierarchy like this:


class A
{
public:
  virtual void f () = 0;
  void g ();
};

class X : public A
{
public:
  void f ();
};

class Y : public A
{
public:
  void f ();
};

A is a parent class with a pure virtual function f(), with X and Y implementing A. A has an additional non-virtual function g(). Now say you have a function foo(A &a):


void
foo (A &a)
{
  a.f ();
}

When foo() calls A::f(), the following things happen:

  1. The object’s vptr (pointer to the class vtable) is used to retrieve the appropriate vtable
  2. The address of f() according to the vtable is loaded from an index into the vtable
  3. The program jumps to the loaded address

You can use the “-fdump-class-hierarchy” flag in g++ to generate the class hierarchy and vtable structure for your code. For the above code, the vtables look like this:

Vtable for A
A::_ZTV1A: 3u entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI1A)
16    __cxa_pure_virtual

Vtable for X
X::_ZTV1X: 3u entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI1X)
16    X::f

Vtable for Y
Y::_ZTV1Y: 3u entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI1Y)
16    Y::f

You can see above entries in each class for the default constructor, the copy-constructor, and then an entry to the pure virtual function f(), with their offsets into the vtable. They’re eight bytes apart because its table of pointers. Note that A::g() does not appear in the vtable – this is because g() is a non-virtual function in A.

Obviously this requires a little more work than a non-virtual function, and there’s been research to determine what exactly the cost of this extra work is. As I mentioned earlier, there’s a lot of argument over the topic, and although it’s really no more expensive than de-referencing a pointer for every virtual function call, there are more subtle consequences regarding branch prediction, pipelining, etc that weigh in.

Regardless of all of that, if you have some very intensive code that uses lots of dynamic polymorphism, it’s going to cost you. Maybe not a lot, but some.

So what does the CRTP have to do with this? Well the goal is to resolve as much type information as possible at compile-time so that you can skip the vtable lookup step at runtime. We can do this by using templates inside the parent class definition. Now consider the following, revised class hierarchy:


template <typename Child>
class A<Child>
{
public:
  void f ();
  void g ();
};

class X : public A<X>
{
public:
  void f ();
};

class Y : public A<Y>
{
public:
  void f ();
};

Now we’ll implement A::f() as so:

template<typename Child>
void
A::f ()
{
  (static_cast<Child*> (this)) -> f ();
}

Ok, this probably need some explaining.

Note how the parent class uses a type Child as its templated typename. In typical uses of generic programming, this is not particularly unusual. Now, each of X and Y are defined as a subclass of A<Child>, where they pass their own class name as the templated type. This is confusing; however, totally syntactically legal. At compile-time, function Child::f() will get placed directly into A::f(), thus avoiding the need for a runtime vtable lookup. As a result, the virtual identifier is dropped from the parent class A, and now the vtables for all classes are empty. Thus, no runtime overhead.

What’s the tradeoff? Well, our function foo(A &a) from earlier no longer works, because A now requires a template parameter. You’d have to create overloaded functions or come up with some other way to deal with the fact that you’re using both A<X> and A<Y>. That’s the flexibility you lose with static polymorphism!

I said earlier that you may have used this pattern without even realizing it. This happened to me when I was working on the project I mentioned – and at the time I didn’t even consider it polymorphism (and thought I was getting away with a hack). Turns out it’s actually a recognized idiom. I guess C++ is like that.


Apache Thrift Tutorial – The Sequel

I’m going to cover building a simple C++ server using the Apache Thrift framework here, while my buddy Ian Chan will cover the front-end PHP interface in his own blog post.

The other day Ian and I were talking and thought it would be cool to do another Facebook/Apache Thrift tutorial, but this time he’d do the front-end client interface and I’d do the backend stuff. He really wanted to do an example of something that you’d find useful to send to a backend for processing from a PHP frontend client. So, we came up with the following thrift interface:

namespace cpp calculator

typedef list<double> Vector

enum BinaryOperation
{
  ADDITION = 1,
  SUBTRACTION = 2,
  MULTIPLICATION = 3,
  DIVISION = 4,
  MODULUS = 5,
}

struct ArithmeticOperation
{
  1:BinaryOperation op,
  2:double lh_term,
  3:double rh_term,
}

exception ArithmeticException
{
  1:string msg,
  2:optional double x,
}

struct Matrix
{
  1:i64 rows,
  2:i64 cols,
  3:list<Vector> data,
}

exception MatrixException
{
  1:string msg,
}

service Calculator
{
  /* Note you can't overload functions */

  double calc (1:ArithmeticOperation op) throws (1:ArithmeticException ae),
  Matrix mult (1:Matrix A, 2:Matrix B) throws (1:MatrixException me),
  Matrix transpose (1:Matrix A) throws (1:MatrixException me),
}

As you can see, we defined a simple calculator with a couple more functions for doing some basic matrix operations (yes, this seems to come up often); something that would suck in PHP. Generated the code with

thrift –gen cpp calculator.thrift

And away we go with the autogenerated C++ code. After you run the thrift generation for C++, it’ll make a directory called gen-cpp/. Under this, you can find relevant files and classes to do work based on your Thrift definition.


$ ls gen-cpp/
calculator_constants.cpp  Calculator_server.skeleton.cpp
calculator_constants.h    calculator_types.cpp
Calculator.cpp            calculator_types.h
Calculator.h

I renamed the generated Calculator_server.skeleton.cpp file (you’ll want to make sure you do this so your work isn’t overwritten the next time you generate Thrift code), and filled in the function stubs adding more functionality as necessary. This file is the only file containing code which you need to edit for your server – you need to fill in the logic here. The other autogenerated files contain necessary transport class, struct, and function code for your server to work. On the other end of things, Ian generated the PHP code and filled in those stubs – you can find his blog post for this project here. We also threw all the code online under Ian’s Github account – you can find all the source here.

Below I’ll list the code I filled in for the backend-side of this project.


#include "Calculator.h"
#include <stdint.h>
#include <cmath>
#include <protocol/TBinaryProtocol.h>
#include <server/TSimpleServer.h>
#include <transport/TServerSocket.h>
#include <transport/TBufferTransports.h>
#include <thrift/concurrency/ThreadManager.h>
#include <thrift/concurrency/PosixThreadFactory.h>
#include TThreadedServer.h>
using namespace ::apache::thrift;
using namespace ::apache::thrift::protocol;
using namespace ::apache::thrift::transport;
using namespace ::apache::thrift::server;
using namespace ::apache::thrift::concurrency;
using boost::shared_ptr;
using namespace calculator;

class CalculatorHandler : virtual public CalculatorIf
{
private:
/* It might be cleaner to stick all these private class functions inside some other class which isn't related to the Thrift interface, but for the sake of brevity, we'll leave them here. */
  double
  __add (double lh_term, double rh_term)
  {
    return (lh_term + rh_term);
  }

  double
  __sub (double lh_term, double rh_term)
  {
    return (lh_term - rh_term);
  }

  double
  __mult (double lh_term, double rh_term)
  {
    return (lh_term * rh_term);
  }

  double
  __div (double lh_term, double rh_term)
  {
    if (rh_term == 0.0)
      {
        ArithmeticException ae;
        ae.msg = std::string ("Division by zero error!");
        throw ae;
      }

    return (lh_term / rh_term);
  }

  double
  __mod (double lh_term, double rh_term)
  {
    if (rh_term == 0.0)
      {
        ArithmeticException ae;
        ae.msg = std::string ("Modulus by zero error!");
        throw ae;
      }

    return std::fmod (lh_term, rh_term);
  }

public:

  CalculatorHandler ()
  {
  }
/* Given the ArithmeticOperation, ensure it's valid and return the resulting value. */
  double
  calc (const ArithmeticOperation& op)
  {
    switch (op.op)
      {
      case ADDITION:
        return __add (op.lh_term, op.rh_term);

      case SUBTRACTION:
        return __sub (op.lh_term, op.rh_term);

      case MULTIPLICATION:
        return __mult (op.lh_term, op.rh_term);

      case DIVISION:
        return __div (op.lh_term, op.rh_term);

      case MODULUS:
        return __mod (op.lh_term, op.rh_term);

      default:
        ArithmeticException ae;
        ae.msg = std::string ("Invalid binary operator provided!");
        throw ae;
      }
  }
/* Multiply A and B together, placing the result in the "return value" C, which is passed as a Matrix reference parameter. */
  void
  mult (Matrix& C, const Matrix& A, const Matrix& B)
  {
    if (A.cols == B.rows && A.rows == B.cols)
      {
        double tmp;

        C.rows = A.rows;
        C.cols = B.cols;
        C.data.resize (C.rows);

        for (uint64_t i = 0; i < A.rows; i++)
          {
            C.data[i].resize (A.cols);

            for (uint64_t j = 0; j < A.cols; j++)
              {
                tmp = 0;
                for (uint64_t k = 0; k < B.rows; k++)
                  {
                    tmp += A.data[i][k] + B.data[k][j];
                  }
                  C.data[i][j] = tmp;
              }
         }
      }
    else
      {
        MatrixException me;
        me.msg = std::string ("Matrices have invalid dimensions for multiplication!");
        throw me;
      }
  }
/* Take the transpose of A and stuff it into the return Matrix T. */
  void
  transpose (Matrix& T, const Matrix& A)
  {
    T.rows = A.cols;
    T.cols = A.rows;
    T.data.resize (A.cols);

    for (uint64_t i = 0; i < A.rows; i++)
      {
        for (uint64_t j = 0; j < A.cols; j++)
          {
            T.data[j].push_back (A.data[i][j]);
          }
      }
  }
};

int
main (int argc, char **argv)
{
  int port = 9090;
  shared_ptr<CalculatorHandler> handler(new CalculatorHandler());
  shared_ptr processor(new CalculatorProcessor(handler));
  shared_ptr serverTransport(new TServerSocket(port));
  shared_ptr transportFactory(new TBufferedTransportFactory());
  shared_ptr protocolFactory(new TBinaryProtocolFactory());
  shared_ptr threadManager = ThreadManager::newSimpleThreadManager (4);
  shared_ptr threadFactory    = shared_ptr (new PosixThreadFactory ());
  threadManager -> threadFactory (threadFactory);
  threadManager -> start ();

 /* This time we'll try using a TThreadedServer, a better server than the TSimpleServer in the last tutorial */
 TThreadedServer server(processor, serverTransport, transportFactory, protocolFactory);
 server.serve();
 return 0;
}

Finally, the code was compiled either with the Makefile I posted onto Ian’s Github repo, or the following build script:

g++ -o calc_server -I./gen-cpp -I/usr/local/include/thrift/ CalculatorServer.cpp gen-cpp/calculator_constants.cpp gen-cpp/Calculator.cpp gen-cpp/calculator_types.cpp -lthrift

So this really isn’t the most complicated program in the world, but it gets the job done fairly simply and effectively (and yes, it actually works!). Note that as opposed to last time I used a TThreadedServer as the base Thrift server type here. Its a little more complicated to set up, but obviously is more useful than a single-threaded server. Interesting things to note:

  • Use of TThreadedServer for a multithreaded server
  • You fill in Thrift exceptions like any other struct, and throw them like any other exception
  • You can use typedefs and enums
  • You can’t overload function names

The last point is a real pain, as far as I am concerned. I’m not sure why the Thrift people couldn’t just mangle the function names so that they resolve to unique entities, but whatever. Anyways, what’s really cool is that we managed to build three common programs in two completely different languages using a single Thrift definition file. A backend in C++, and a frontends in PHP. Hope you find this useful – happy hacking!


COBOL and the collapse of the Canadian government

Sheila Fraser made an announcement last week which brought an extremely important issue onto the national stage: the Government of Canada’s IT infrastructure is about to implode. In fact, it’s so bad that key government services may shut down because the systems behind them are in such bad shape.

Apparently the National Immigration Program runs on COBOL and a database system which has been dead for over twenty years. Also the Department of Public Works’ pay and pension system is about to collapse. And the Auditor-General says it’ll take billions of dollars to fix.

The problem here is that the government never implemented any sort of continuous maintenance and update strategy for its technical infrastructure. And there apparently isn’t any sort of reasonably competent department to run such a program. The fact is, if your organization (or, government) is going to rely on an IT infrastructure for all its day-to-day operations, you can’t let it rot. You need technically competent people to maintain the system. You need to keep up-to-date with changes in the industry. And you need to investigate how newer and better tools will make the continual growth and expansion of your IT services manageable. Otherwise it’ll crumble. The situation out in Ottawa is exemplified by Stockwell Day’s brilliant statement:

“As you know, with technology, there are always people who are saying you should have newer and better.”

Apparently the problem with software is that it needs to be updated every forty years or so. Other government tools such as cars, pencils and coffee makers don’t suffer from the same issue.

My biggest concern isn’t the cost to fix the issue, it’s potentially more serious. If the Government of Canada can’t keep its legacy systems up-to-date (we’re talking almost half a century old technology here), then how is it supposed to protect sensitive information about itself and the citizens of the country from malicious hackers and other threats?


Bloom Filters

I’ve always thought that bloom filters were really cool data structures. They belong to a small class of data structures described as “probabilistic”, meaning, there’s a trade-off between performance and accuracy. CS people know about trade-offs all too well, whether they’re related to space/time or approximations of expensive algorithms. Bloom filters kind of fall into both categories.

The purpose of a bloom filter is to indicate, with some chance of error, whether an element belongs to a set. This error refers to the fact that it is possible that the bloom filter indicates some element is in the set, when it in fact is not in the set (false positive). The reverse, however, is not possible – if some element is in the set, the bloom filter cannot indicate that it is not in the set (false negative).

So how does it work?

A bloom filter requires a set of k hash functions (one of my favourite topics) to generate indices into a table. These can either be a series of fairly simple and quick functions, or some manipulation of one or few hash functions. For example, you could run your data through SHA1 and extract integral types to use as indices. Or, use a faster hash function like murmur hash with several different seeds.

Each of these k hashes is mapped to a bit in a bitmap array, which is initialized to contain all zeroes. Think of it like a hash table where each entry is hashed to multiple locations. But instead of storing the hash and the object, you set the bit at each location to 1. Because we only care about existence or non-existence in this set, we don’t need to store any additional information.

Shamelessly taken from Wikipedia (and yes, I checked the license first =)

So, for example, lets say that we’ve inserted elements x, y, and z into the bloom filter, with k = 3 hash functions, like above. Each of these three elements have three bits each set to 1. In the case when a bit is already set to 1, its left that way. When we try to look up element w in the set, the bloom filter tells us that w isn’t an element because at least one of its bitmap indices is 0.

This is the reason why bloom filters demonstrate false positives but not false negatives. A new element may overlap with the combined bitmap entries of many different elements already in the set. However, there is no operation that sets a bit to 0. So if a bitmap entry is 0, there was never an element inserted into this set that mapped to that location. As a related sidenote, it is extremely difficult to come up with a scheme that removes elements from these sets; usually they are just rebuilt instead.

So why is this useful, as opposed to a hashtable-set type data structure? The space requirements of a bloom filter are significantly lower than a hashtable implementation. This is because

  • Bloom filters use bits instead of larger elements to determine set existence
  • These bits may overlap between set entries, so not every entry has k dedicated bits

Its also interesting because the insertion and checking operations have a time complexity of O(k), based solely on the number of hash functions, instead of the number of elements inserted into the set. There’s no probing or other strategies used here to deal with collisions, as with hash tables. There are “collisions” in the sense of false positives, but these do not impact performance; however with a large enough bitmap and a high enough k value, these can be greatly minimized. In addition, these data structures are typically employed in non-critical services (such as caching existence of data) under the assumption that false positives may occur.

Here’s some python-esque pseudocode that illustrates insertion and existence-checking operations on a simple bloom filter:


# Assumption: 8 bits in a byte

def init(N, hashfnct_list):
 __bitmap_vector = alloc_bytes ((N + 1) / 8 )
 __fnct_list = hashfnct_list
 __N = N

def insert(data):

 # iterate over each hash function in the list and hash the data
 # we could even do this in parallel
 for f in __fnct_list:

  # mod the hash by __N because that is the size
  # of our bitmap vector
  h = f(data) % __N
  m = h / 8

  # m is the byte index, shift hash's remainder mod 8
  # to find the bit index
  __bitmap_vector[m] = __bitmap_vector[m] | (1 << (h % 8))

def contains(data):
 for f in __fnct_list:

  h = f(data) % __N
  m = h / 8

  if __bitmap_vector[m] & (1 << (h % 8)) == 0:
   return False

 return True

There are several variations on this basic implementation which increase the feature set of bloom filters. You can find more details here.


Strassen’s Algorithm (Theory vs Application part 2)

So last time I mentioned that I didn’t really want to get into anything too complicated. However, I decided to go the whole nine yards and build an implementation of Strassen’s Algorithm for matrix multiplication, to compare it against the more conventional methods I was experimenting with. Thanks to a little help from CLRS I got the job done in a few hours, and began running some benchmarks.

Disclaimer: I am not some kind of rocket scientist who is giving you a guarantee the algorithms I implemented are perfect or super-optimized. I’ll only give you the assurance that they are a reasonable implementation in C++.

Strassen’s algorithm is interesting because given two matrices A and B to be multiplied, it uses divide-and-conquer to break each input matrix into submatrices and then performs seven matrix multiplications on these. Normally we would perform eight matrix multiplications for a naive algorithm. As a result, the asymptotic complexity works out to O(nlog27) = O(n2.8073), which is better (again, in theory) than naive implementations, which run at O(n3).

My initial experiments with Strassen’s were fairly disappointing. It ran significantly more slowly than a transpose-naive algorithm, and consumed considerably more memory. In addition, Strassen’s requires the input matrices to be square, with each dimension a power of two. If A and B are not a power of two, they are padded with zeroes. This adds preliminary overhead to its execution, not to mention it artificially inflates the size of the matrices to be potentially significantly larger, which means more computations. Also, its extremely inefficient on small matrices. So here are my customized hacks to the vanilla algorithm:

  • Input matrices or submatrices which have dimension N <= 256 are multiplied using my transpose-naive algorithm from earlier; larger ones continue with recursive Strassen operations.
  • Sections of the algorithm which focus on multiplying submatrices from zero-padded sections of the input matrices skip computation and return a zero-matrix.

In addition, I also implemented a parallel version of the algorithm:

  • After the initial compartmentalization of the two input matrices, seperate threads handle the subsequent seven recursive submatrix multiplication operations.

My simple benchmarks involved multiplying random matrices of incrementally growing sizes and timing how long it took for each N x N multiplication. Each was repeated numerous times and I took the average. On to the pictures:

As you can see, I had to stop collecting data for the naive algorithm, because it was bringing the experiment to a standstill. We all knew how that one would perform; I thought it would be interesting to see it in perspective. Remember that the Naive and Transpose algorithms are asymptotically equal (!). Here’s a version with the naive data removed:

Some notes:

  • The transpose-naive algorithm performed very well with smaller matrices, however as N grew past 3000 elements, its performance degraded badly.
  • I don’t know why the transpose-naive method resulted in such uneven growth.
  • The Strassen algorithm performed poorly until about N = 3000, at which point it began to outperform the parallel transpose-naive algorithm.
  • Strassen’s Algorithm suffers a sharp performance drop each time N grows past a power of 2, since it needs to expand the input matrices to the next larger power of 2.
  • After N = 2500, the parallel Strassen’s Algorithm was the clear winner.

To see what I mean about the performance penalty suffered by Strassen’s, take a look at the graph zoomed-in between 0 < N < 2500:

At N = 1024, 2048, you can see the sharp jump in runtime in Strassen’s compared to the other conventional algorithms. Unfortunately, I had to end the test around N = 4000 because the Strassen calculation caused my computer to run out of memory. I would make the assumption that similar performance penalties are suffered at larger powers of 2.

So, what are the conclusions? Well, it appears that the long-term clear winner is a parallel implementation of Strassen’s Algorithm. In general, this algorithm (both in serial and parallel) performed well on large matrices, but don’t necessarily outperform more simple algorithms when the matrix size is not close to the upper power of 2. When the matrix is close to the upper power of two, Strassen’s can even outperform parallel versions of simpler methods, even with all its extra nasty overhead. Pretty cool – theory wins.

Update: I’ve posted most of the source code I used for this post here


Theory vs Application

A few days ago I started hacking on a small and fairly simple linear algebra library in C++ (cause that’s how I roll). It does the typical stuff, like various matrix or vector operations. When it came time to implement matrix multiplication, I really didn’t feel like getting into LAPACK and all that stuff, so I threw the functionality together myself. After running some tests with relatively large random matrices (a few thousand rows and columns), I was disappointed with how badly the naive matrix multiplication performed. As I should be, right? After all, we learned back in school that this operation is an O(n3) algorithm.

Well, hold on.

I started thinking about quick ways to improve this performance without getting into anything too complicated. I realized that a huge part of the problem here is how the elements of these matrices are accessed. Your typical matrix multiplication takes the rows of A, the columns of B, and dot-products each of them together into an element of C. However, unless you’re programming in FORTRAN, these column accesses are slow. In row-based array languages, like C++, you want your memory accesses to be as close together as possible, thanks to caching optimizations from your OS and hardware. Jumping all over columns isn’t good for memory locality.

So how do you avoid jumping over columns and access only rows for both matrices? I took the transpose and multiplied the matrices that way. Guess what? Huge (relative) speed improvements. Now, the MM algorithm runs along pairs of rows from A and B to compute the elements of C. What makes this better is that it becomes even more embarrassingly parallel. After throwing in a few pthreads, the performance on my 4-core machine improved even further (although not always quadruple).

The point of this discussion is that even after taking the matrix transpose into account, the second method is significantly faster. If you want to get technical, its complexity works out to O(n3 + n2), which goes to show you that the constant factor in front of the naive implementation must be enormous. Here’s some matrix multiplication results from identically-sized random matrices to illustrate the significant difference between these algorithms:


MM array size: 1023 x 1023
MM time [standard]: 6.200862
MM time [transpose]: 1.138916
MM time [parallel]: 0.323271

MM array size: 2317 x 2317
MM time [standard]: 76.909222
MM time [transpose]: 12.978604
MM time [parallel]: 5.118813

MM array size: 3179 x 3179
MM time [standard]: 220.900443
MM time [transpose]: 33.185827
MM time [parallel]: 13.594514


Facebook Thrift Tutorial

If you’ve never heard of Facebook Thrift before (or Apache Thrift, now that Facebook has released the source), then you’re missing out on an incredibly useful technology. Thrift is an RPC platform for developing cross-language (also useful for inter-language) services which is far more efficient and far less time-consuming to build than some more traditional approaches. Supported languages include Java, Python, PHP, and a pile of others. If you have heard of Thrift before, then you’ll know that the online documentation sucks, and its fairly difficult to get started.

I’ve had a fair bit of experience with Thrift, and I thought I’d share some of what I’ve learned.

Thrift works by taking a language-independent Thrift definition file and generating source code from it. This definition file may contain data structure definitions, enumerations, constant definitions, exceptions, and service interfaces. A Thrift ‘service’ is the most important part – here, you’ll define what sort of API you’d like to use as a developer, and the Thrift code generation will output all the necessary serialization and transfer code. Lets start with a simple example of a remote logging system in C++.

First of all, we’ll create a Thrift definition file called logger.thrift:

namespace cpp logger

struct LogMessage
{
  1:i64 timestamp,
  2:string message
}

struct LogMessageBatch
{
  1:list<LogMessage> msgs
}

exception LoggingException
{
  1:string msg
}

service Logger
{
  void log (1:LogMessage lm),
  void batch (1:LogMessageBatch lmb),

  LogMessage getLastMessage ()
}

Note that in the above, you can define structures which aggregate other structures. In this case, a LogMessageBatch contains a list of LogMessages. You can find out all the details about these files here. The next step is to generate the C++ code:

thrift –gen cpp logger.thrift

This will generate a new directory called “gen-cpp” which contains various files. There’s a lot of auto-generated code here, but we’ll first take a look at the file called Logger_server.skeleton.cpp:

// This autogenerated skeleton file illustrates how to build a server.
// You should copy it to another filename to avoid overwriting it.

#include "Logger.h"
#include TBinaryProtocol.h>;
#include TSimpleServer.h>;
#include <transport/TServerSocket.h>;
#include <transport/TBufferTransports.h>;

using namespace ::apache::thrift;
using namespace ::apache::thrift::protocol;
using namespace ::apache::thrift::transport;
using namespace ::apache::thrift::server;

using boost::shared_ptr;

using namespace logger;

class LoggerHandler : virtual public LoggerIf {
public:
LoggerHandler() {
// Your initialization goes here
}

void log(const LogMessage &lm) {
// Your implementation goes here
printf("log\n");
}

void batch(const LogMessageBatch &lmb) {
// Your implementation goes here
printf("batch\n");
}

void getLastMessage(LogMessage &_return) {
// Your implementation goes here
printf("getLastMessage\n");
}

};

int main(int argc, char **argv) {
int port = 9090;
shared_ptr<LoggerHandler> handler(new LoggerHandler());
shared_ptr processor(new LoggerProcessor(handler));
shared_ptr serverTransport(new TServerSocket(port));
shared_ptr transportFactory(new TBufferedTransportFactory());
shared_ptr protocolFactory(new TBinaryProtocolFactory());

TSimpleServer server(processor, serverTransport, transportFactory, protocolFactory);
server.serve();
return 0;
}

Those paying attention here will notice that the skeleton code here matches the definitions given in the Thrift definition file. The logic which you need to fill in here inside the LoggerHandler defines how the server-side of your program runs. So here, fill in what you want to do with each incoming log message, throwing exceptions as necessary. This part should be fairly obvious. Also note that incoming Thrift objects are always defined as const reference (in C++, anyway), and functions which are supposed to return a Thrift object do so by storing the relevant information in a non-const reference variable.

If you look inside the main() function at the bottom of the code, you’ll see some boilerplate code to define some of the necessary structures to get a Thrift server running. There are various other classes available to you to run from here as well; for example, instead of a TSimpleServer, you may want to run a server with multiple threads, such as the TThreadPoolServer. Depending on your install path, the options available to you should be somewhere like /usr/local/include/thrift/. Also note that the thrift classes make use of boost shared_ptrs very often.

Now, lets just briefly implement the batch() function from above to give an idea how to interact with these Thrift objects.

void batch(const LogMessageBatch &lmb) {

  std::vector<LogMessage>::const_iterator lmb_iter = lmb.msgs.begin ();
  std::vector<LogMessage>::const_iterator lmb_end = lmb.msgs.end ();

  while (lmb_iter != lmb_end)
  {
    log (*lmb_iter++); // Use the other thrift-defined interface function to write to disk, whatever.
  }
}

What to note here:

  • You access the fields of each Thrift object directly
  • Remember that all the incoming Thrift objects and their fields are const
  • LogMessageBatch contains a ‘list’ of LogMessages in the definition file, but after C++ generation its defined as a ‘vector’. This is just Thrift’s choice of container to represent a list on the generated code end of things. I’m sure other languages share similar oddities.

The rest should be easy. Now, the big question is, how do we use this server? We need to use the generated client-side interface, defined in Logger.h and Logger.cpp. Its called LoggerClient, and in order to use it, it needs to be instantiated with a Thrift protocol pointer. Here’s one way to do it:

#include "Logger.h"
#include TSocket.h>
#include <transport/TBufferTransports.h>
#include <protocol/TBinaryProtocol.h>

using namespace apache::thrift;

namespace logger
{
class LoggerClientInterface
{
private:
  LoggerClient *__client;

public:
LoggerClientInterface ()
{
  boost::shared_ptr<transport::TSocket> socket = boost::shared_ptr<transport::TSocket>
      (new transport::TSocket ("127.0.0.1", 12345));

  boost::shared_ptrTBufferedTransport> transport
      = boost::shared_ptr<transport::TBufferedTransport>
        (new transport::TBufferedTransport (socket));

  boost::shared_ptr<protocol::TBinaryProtocol> protocol = boost::shared_ptr<protocol::TBinaryProtocol>
      (new protocol::TBinaryProtocol (transport));

  transport -> open ();
  __client = new LoggerClient (protocol);
}

void log (LogMessage &lm)
{
  __client -> log (lm);
}
};
}

After initializing the LoggerClient object, you can either use it directly, or wrap its functionality in another class, like LoggerClientInterface::log() above. And you’re done. Remember to include /usr/local/include/thrift and link to -lthrift, and you’re good to go.

Although there’s a fair bit of manual setup to get a thrift service running; its actually a lot of boilerplate copy-and-paste, which you can easily move from one service to the next. After that, its pretty easy once you get the hang of it. Good luck!