Category Archives: software engineering

Virtual Memory (Theory vs Application part 3)

I think I was reading Reddit a little while ago when a pretty interesting article from the ACM Communications by Poul-Henning Kamp came up. It might be a little sensationalist, but its a pretty valid examination of what you can do to program performance when you consider how modern operating systems manage virtual memory. Kamp spent a lot of time building Varnish, and although I’d never heard of it before, it sounds like a very impressive utility. Its an HTTP accelerator and cache whose performance apparently blows Squid out of the water.

The meat of the article discusses how conventional thinking about the implementation of algorithms and data structures on modern computers is broken because they fail to consider the consequences of cache misses and page faults when accessing data.

On a modern multi-issue CPU, running at some gigahertz clock frequency, the worst-case loss is almost 10 million instructions per VM page fault. If you are running with a rotating disk, the number is more like 100 million instructions. What good is an O(log2(n))algorithm if those operations cause page faults and slow disk operations? For most relevant datasets an O(n) or even an O(n2) algorithm, which avoids page faults, will run circles around it.

Kamp provides some images and diagrams to visualize some of what he’s talking about. As a traditionally more systems-focused guy, I might be a little biased, but I feel that modern CS curriculums pump out students who have absolutely no idea about how computers actually work. Knowing the asymptotically optimal algorithm for everything is great; but if your practical implementation sucks because you’re causing paging all over the place, what good is it? A reasonable mixture makes sense – and the curriculum could make room by removing that course on CVS (and maybe even calculus – barf!)

I discussed a little earlier about how more intelligent practical implementations can have enormous effects over more naive solutions even though they remain asymptotically equal. Imagine wasting a hundred million instructions because your algorithm’s implementation unnecessarily left some node on disk!


Jack Dorsey Talking at Digg

Today, Ken sent me this video of a tech talk by Jack Dorsey while visiting Digg. I usually don’t have the patience to sit through 40-minute video lectures, but I did watch this one in its entirety.

Dorsey was informally talking about some of his experiences leading up to his founding of Twitter, the challenges they faced early on (especially after their feature at, and subsequent popularity after, SXSW in 2007) and some of the approaches they took to resolve them. I’d recommend watching it as well. Let me summarize some of the points Dorsey made which I found particularly interesting:

We didn’t really turn the company and technology around for a year. And, Some of the reasons for that were that we had no instrumentation at all. We had no idea of what was going on in the system.

Apparently Twitter was going down daily after the SXSW conference, and the engineers at Twitter were not able to isolate all the bottlenecks and points of failure in their systems because of inadequate system monitoring, logging, analysis, etc. Dorsey says that at his new company (Square), he learned from the above mistake, and one of their initial priorities was building an exceptional dashboard which became a critical component of development at the company.

That’s a pretty tough situation to be in, especially for a new company with limited resources. When you’re a small company on a limited budget, you need to make all that data accessible in order to make the right decisions. You need to be able to pinpoint every fault, every corner case, and know everything about them. That’s the difference between having to buy a new set of webservers and realizing your caching policy is just designed poorly.

Next:

We’ve had some bad engineering discipline, where we would try to isolate ourselves a bit too much in our work. So, we had some tendencies to really work on singular pieces of code [sic] and it came to the point where there was only one person in the company who knew how the entire system worked. That create a huge single point of failure. We were so afraid that any little change would bring down the entire site.

Twitter approached this problem simultaneously by creating two teams:

  1. The architecture team: redesign the entire Twitter system from the ground-up, with the knowledge they’d gained
  2. The iterations team: build instrumentation for every aspect of the system, and incrementally fix and improve it

Now, I would have thought that re-designing a broken system from the ground-up would have resulted in the best approach to the problem. What’s interesting is that the architecture team failed to produce a viable alternative, and in the meantime, the iterations team actually fixed all the problems using meticulous system analytics and group review of code. In addition, part of the solution involved the iterations team adopting pair programming. In addition, the pairs of people constantly rotated around the company, so that everyone had at least some sense of how the entire system worked. As a result, they had a lot more people talking, more perspectives on problems, and in general, better solutions. In addition, it avoided the problem of isolating people. Dorsey says that it created a very creative environment because of the constant exchange of ideas.

I, like most people who write software, make that “…ew” face when I hear someone mention PP. However, I realized that there is no better way to get more brains solving more common problems than the solution Twitter came up with. Big problems would receive much greater exposure to different perspectives, different experiences, and general discussion than if the conversation never left a small team (or individual) who was responsible for fixing them. I had an epiphany, kind of like that time Gerry invited Ian and I to that design patterns meeting. Lots of good ideas that solve real problems.

Anyways, I hope you watch the video and find it as interesting as I did. It’s a good story.


The Curiously Recurring Template Pattern

The CRTP is a C++ idiom which is both confusing when you read about it, but straightforward enough that you may have already used it without realizing it. Its kind of funny that way; a strange recursive templating technique.

I was working on some new system at work which involved some tedious class hierarchies in performance-critical components. While doing some research, I fell up on the CRTP wiki page while researching the enduring battle between the performance of polymorphic code and the complexities of modern CPU architectures.

Hold on a minute. Some background first.

Dynamic polymorphism is a language abstraction in which different types of objects can use generic interfaces, and the type of object (and thus, the nature of the work to be done) is determined at runtime. This is achieved in C++ though the use of virtual functions, which in turn have entries in a class virtual function tables. Because parent classes and sibling classes can share virtual function names, the purpose of a vtable is for a class to say “my implementation of function f() is located here.” When an object of a particular type is instantiated, the compiler throws into it a pointer to the vtable for the proper class. Class functions which are non-virtual do not require entries in a vtable because there is only one implementation and location of that function.

Static polymorphism is a language abstraction where different types of objects are determined at compile-time, and as a result do not require work at runtime to determine the type of the object or variable. As a result, class hierarchies built-in this fashion don’t require a vtable. This is where the CRTP comes in.

Say you have a class hierarchy like this:


class A
{
public:
  virtual void f () = 0;
  void g ();
};

class X : public A
{
public:
  void f ();
};

class Y : public A
{
public:
  void f ();
};

A is a parent class with a pure virtual function f(), with X and Y implementing A. A has an additional non-virtual function g(). Now say you have a function foo(A &a):


void
foo (A &a)
{
  a.f ();
}

When foo() calls A::f(), the following things happen:

  1. The object’s vptr (pointer to the class vtable) is used to retrieve the appropriate vtable
  2. The address of f() according to the vtable is loaded from an index into the vtable
  3. The program jumps to the loaded address

You can use the “-fdump-class-hierarchy” flag in g++ to generate the class hierarchy and vtable structure for your code. For the above code, the vtables look like this:

Vtable for A
A::_ZTV1A: 3u entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI1A)
16    __cxa_pure_virtual

Vtable for X
X::_ZTV1X: 3u entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI1X)
16    X::f

Vtable for Y
Y::_ZTV1Y: 3u entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI1Y)
16    Y::f

You can see above entries in each class for the default constructor, the copy-constructor, and then an entry to the pure virtual function f(), with their offsets into the vtable. They’re eight bytes apart because its table of pointers. Note that A::g() does not appear in the vtable – this is because g() is a non-virtual function in A.

Obviously this requires a little more work than a non-virtual function, and there’s been research to determine what exactly the cost of this extra work is. As I mentioned earlier, there’s a lot of argument over the topic, and although it’s really no more expensive than de-referencing a pointer for every virtual function call, there are more subtle consequences regarding branch prediction, pipelining, etc that weigh in.

Regardless of all of that, if you have some very intensive code that uses lots of dynamic polymorphism, it’s going to cost you. Maybe not a lot, but some.

So what does the CRTP have to do with this? Well the goal is to resolve as much type information as possible at compile-time so that you can skip the vtable lookup step at runtime. We can do this by using templates inside the parent class definition. Now consider the following, revised class hierarchy:


template <typename Child>
class A<Child>
{
public:
  void f ();
  void g ();
};

class X : public A<X>
{
public:
  void f ();
};

class Y : public A<Y>
{
public:
  void f ();
};

Now we’ll implement A::f() as so:

template<typename Child>
void
A::f ()
{
  (static_cast<Child*> (this)) -> f ();
}

Ok, this probably need some explaining.

Note how the parent class uses a type Child as its templated typename. In typical uses of generic programming, this is not particularly unusual. Now, each of X and Y are defined as a subclass of A<Child>, where they pass their own class name as the templated type. This is confusing; however, totally syntactically legal. At compile-time, function Child::f() will get placed directly into A::f(), thus avoiding the need for a runtime vtable lookup. As a result, the virtual identifier is dropped from the parent class A, and now the vtables for all classes are empty. Thus, no runtime overhead.

What’s the tradeoff? Well, our function foo(A &a) from earlier no longer works, because A now requires a template parameter. You’d have to create overloaded functions or come up with some other way to deal with the fact that you’re using both A<X> and A<Y>. That’s the flexibility you lose with static polymorphism!

I said earlier that you may have used this pattern without even realizing it. This happened to me when I was working on the project I mentioned – and at the time I didn’t even consider it polymorphism (and thought I was getting away with a hack). Turns out it’s actually a recognized idiom. I guess C++ is like that.


Apache Thrift Tutorial – The Sequel

I’m going to cover building a simple C++ server using the Apache Thrift framework here, while my buddy Ian Chan will cover the front-end PHP interface in his own blog post.

The other day Ian and I were talking and thought it would be cool to do another Facebook/Apache Thrift tutorial, but this time he’d do the front-end client interface and I’d do the backend stuff. He really wanted to do an example of something that you’d find useful to send to a backend for processing from a PHP frontend client. So, we came up with the following thrift interface:

namespace cpp calculator

typedef list<double> Vector

enum BinaryOperation
{
  ADDITION = 1,
  SUBTRACTION = 2,
  MULTIPLICATION = 3,
  DIVISION = 4,
  MODULUS = 5,
}

struct ArithmeticOperation
{
  1:BinaryOperation op,
  2:double lh_term,
  3:double rh_term,
}

exception ArithmeticException
{
  1:string msg,
  2:optional double x,
}

struct Matrix
{
  1:i64 rows,
  2:i64 cols,
  3:list<Vector> data,
}

exception MatrixException
{
  1:string msg,
}

service Calculator
{
  /* Note you can't overload functions */

  double calc (1:ArithmeticOperation op) throws (1:ArithmeticException ae),
  Matrix mult (1:Matrix A, 2:Matrix B) throws (1:MatrixException me),
  Matrix transpose (1:Matrix A) throws (1:MatrixException me),
}

As you can see, we defined a simple calculator with a couple more functions for doing some basic matrix operations (yes, this seems to come up often); something that would suck in PHP. Generated the code with

thrift –gen cpp calculator.thrift

And away we go with the autogenerated C++ code. After you run the thrift generation for C++, it’ll make a directory called gen-cpp/. Under this, you can find relevant files and classes to do work based on your Thrift definition.


$ ls gen-cpp/
calculator_constants.cpp  Calculator_server.skeleton.cpp
calculator_constants.h    calculator_types.cpp
Calculator.cpp            calculator_types.h
Calculator.h

I renamed the generated Calculator_server.skeleton.cpp file (you’ll want to make sure you do this so your work isn’t overwritten the next time you generate Thrift code), and filled in the function stubs adding more functionality as necessary. This file is the only file containing code which you need to edit for your server – you need to fill in the logic here. The other autogenerated files contain necessary transport class, struct, and function code for your server to work. On the other end of things, Ian generated the PHP code and filled in those stubs – you can find his blog post for this project here. We also threw all the code online under Ian’s Github account – you can find all the source here.

Below I’ll list the code I filled in for the backend-side of this project.


#include "Calculator.h"
#include <stdint.h>
#include <cmath>
#include <protocol/TBinaryProtocol.h>
#include <server/TSimpleServer.h>
#include <transport/TServerSocket.h>
#include <transport/TBufferTransports.h>
#include <thrift/concurrency/ThreadManager.h>
#include <thrift/concurrency/PosixThreadFactory.h>
#include TThreadedServer.h>
using namespace ::apache::thrift;
using namespace ::apache::thrift::protocol;
using namespace ::apache::thrift::transport;
using namespace ::apache::thrift::server;
using namespace ::apache::thrift::concurrency;
using boost::shared_ptr;
using namespace calculator;

class CalculatorHandler : virtual public CalculatorIf
{
private:
/* It might be cleaner to stick all these private class functions inside some other class which isn't related to the Thrift interface, but for the sake of brevity, we'll leave them here. */
  double
  __add (double lh_term, double rh_term)
  {
    return (lh_term + rh_term);
  }

  double
  __sub (double lh_term, double rh_term)
  {
    return (lh_term - rh_term);
  }

  double
  __mult (double lh_term, double rh_term)
  {
    return (lh_term * rh_term);
  }

  double
  __div (double lh_term, double rh_term)
  {
    if (rh_term == 0.0)
      {
        ArithmeticException ae;
        ae.msg = std::string ("Division by zero error!");
        throw ae;
      }

    return (lh_term / rh_term);
  }

  double
  __mod (double lh_term, double rh_term)
  {
    if (rh_term == 0.0)
      {
        ArithmeticException ae;
        ae.msg = std::string ("Modulus by zero error!");
        throw ae;
      }

    return std::fmod (lh_term, rh_term);
  }

public:

  CalculatorHandler ()
  {
  }
/* Given the ArithmeticOperation, ensure it's valid and return the resulting value. */
  double
  calc (const ArithmeticOperation& op)
  {
    switch (op.op)
      {
      case ADDITION:
        return __add (op.lh_term, op.rh_term);

      case SUBTRACTION:
        return __sub (op.lh_term, op.rh_term);

      case MULTIPLICATION:
        return __mult (op.lh_term, op.rh_term);

      case DIVISION:
        return __div (op.lh_term, op.rh_term);

      case MODULUS:
        return __mod (op.lh_term, op.rh_term);

      default:
        ArithmeticException ae;
        ae.msg = std::string ("Invalid binary operator provided!");
        throw ae;
      }
  }
/* Multiply A and B together, placing the result in the "return value" C, which is passed as a Matrix reference parameter. */
  void
  mult (Matrix& C, const Matrix& A, const Matrix& B)
  {
    if (A.cols == B.rows && A.rows == B.cols)
      {
        double tmp;

        C.rows = A.rows;
        C.cols = B.cols;
        C.data.resize (C.rows);

        for (uint64_t i = 0; i < A.rows; i++)
          {
            C.data[i].resize (A.cols);

            for (uint64_t j = 0; j < A.cols; j++)
              {
                tmp = 0;
                for (uint64_t k = 0; k < B.rows; k++)
                  {
                    tmp += A.data[i][k] + B.data[k][j];
                  }
                  C.data[i][j] = tmp;
              }
         }
      }
    else
      {
        MatrixException me;
        me.msg = std::string ("Matrices have invalid dimensions for multiplication!");
        throw me;
      }
  }
/* Take the transpose of A and stuff it into the return Matrix T. */
  void
  transpose (Matrix& T, const Matrix& A)
  {
    T.rows = A.cols;
    T.cols = A.rows;
    T.data.resize (A.cols);

    for (uint64_t i = 0; i < A.rows; i++)
      {
        for (uint64_t j = 0; j < A.cols; j++)
          {
            T.data[j].push_back (A.data[i][j]);
          }
      }
  }
};

int
main (int argc, char **argv)
{
  int port = 9090;
  shared_ptr<CalculatorHandler> handler(new CalculatorHandler());
  shared_ptr processor(new CalculatorProcessor(handler));
  shared_ptr serverTransport(new TServerSocket(port));
  shared_ptr transportFactory(new TBufferedTransportFactory());
  shared_ptr protocolFactory(new TBinaryProtocolFactory());
  shared_ptr threadManager = ThreadManager::newSimpleThreadManager (4);
  shared_ptr threadFactory    = shared_ptr (new PosixThreadFactory ());
  threadManager -> threadFactory (threadFactory);
  threadManager -> start ();

 /* This time we'll try using a TThreadedServer, a better server than the TSimpleServer in the last tutorial */
 TThreadedServer server(processor, serverTransport, transportFactory, protocolFactory);
 server.serve();
 return 0;
}

Finally, the code was compiled either with the Makefile I posted onto Ian’s Github repo, or the following build script:

g++ -o calc_server -I./gen-cpp -I/usr/local/include/thrift/ CalculatorServer.cpp gen-cpp/calculator_constants.cpp gen-cpp/Calculator.cpp gen-cpp/calculator_types.cpp -lthrift

So this really isn’t the most complicated program in the world, but it gets the job done fairly simply and effectively (and yes, it actually works!). Note that as opposed to last time I used a TThreadedServer as the base Thrift server type here. Its a little more complicated to set up, but obviously is more useful than a single-threaded server. Interesting things to note:

  • Use of TThreadedServer for a multithreaded server
  • You fill in Thrift exceptions like any other struct, and throw them like any other exception
  • You can use typedefs and enums
  • You can’t overload function names

The last point is a real pain, as far as I am concerned. I’m not sure why the Thrift people couldn’t just mangle the function names so that they resolve to unique entities, but whatever. Anyways, what’s really cool is that we managed to build three common programs in two completely different languages using a single Thrift definition file. A backend in C++, and a frontends in PHP. Hope you find this useful – happy hacking!