Tag Archives: c++

Customizing TR1 unordered_map Hashing and Equality Functions

A common problem people run into when using the C++ std::tr1::unordered_map or std::tr1::unordered_set is when they want to use a custom or aggregate type as the key in the hash table, and aren’t sure how to do it.

There isn’t really any official documentation about this; you can find some forum and StackOverflow discussions which may help; but I’ll run through a detailed example here too. Let’s start by taking a look at how the unordered_map class is defined:

template<class Key,
         class T,
         class Hash = hash<Key>,
         class Pred = std::equal_to<Key>,
         class Alloc = std::allocator<std::pair<const Key, T> > >
  class unordered_map;

The templated types in the class are:

  • class Key: key type (duh)
  • class T: value type
  • class Hash: hash function for type Key
  • class Pred: an equality predicate for type Key
  • class Alloc: the memory allocator for <Key, T>

The function represented in class Hash produces a hash value given a reference to type Key. This is the value that the map will use to locate the appropriate bucket to insert the corresponding value. The function represented by Pred is a predicate; it returns true if the values (as opposed to hash values) of Key types are equal. If the Key type is a class with operator== () defined, then defining a separate Pred function won’t be an issue.

Changing the Hash and Pred types is easy, but non-obvious. The most common reasons for this are either using a custom type as the key for the hash table, as I mentioned above, or for using a custom hash function to achieve better performance from the container.

Let’s say we have the following setup:

typedef struct
{
  uint32_t a;
  double x;
  double y;
} AggregateKey;

{ ... }

AggregateKey k;

{...}

std::tr1::unordered_map<AggregateKey, int> M;
M[k] = 1;

{...}

Attempting to compile this will generate a ton of error output. Depending on the situation, it’ll complain about Pred with something like

error: no match for ‘operator==’ in ‘__x == __y’

or about Hash with

undefined reference to `std::tr1::hash<AggregateKey>::operator()(AggregateKey) const'

Fix the issue by defining the following functors:

typedef struct
{
  long operator() (const AggregateKey &k) const { return my_hash_fnct (k); }
} AggregateKeyHash;

typedef struct
{
  bool operator() (const AggregateKey &x, const AggregateKey &y) const { return my_eq_test (x, y); }
} AggregateKeyEq;

AggregateKey k;

{...}

// Now, hash value generation and equality&nbsp;testing are
// defined for the AggregateKeyHash&nbsp;type, so declare the
// map using the functors&nbsp;above in the object's template list
std::tr1::unordered_map<AggregateKey, int, AggregateKeyHash, AggregateKeyEq> M;
M[k] = 1;

{...}

The same logic applies to the std::tr1::unordered_set, with the exception that its template list doesn’t include a class T value type. Also note that you can grab the any map’s default hash function and equality predicate at any time from the object:

std::tr1::unordered_map<int, int> M;

const std::tr1::unordered_map<int, int>::hasher &hfn = M.hash_function ();
const std::tr1::unordered_map<int, int>::key_equal &eqfn = M.key_eq ();

long h = hfn (123);
bool b = eqfn (1, 2);
Advertisements

ptrace() Tutorial

The ptrace() system call provides the unique functionality under many *nix systems allowing a process to examine the data, and control the execution of, another running process. This includes the ability to read and write register values, arbitrary values from memory, and signals. The tracing process (from here on, the parent) may establish this relationship with the process being examined (the child) by either fork()ing the process as a literal child process, or by assuming the temporary parental responsibilities of the running child process by attaching to it during execution. The most useful applications of this system call would be building debuggers and process tracing tools. I’m also going to focus on the GNU/Linux version of ptrace.

There isn’t a ton of online documentation about how to use ptrace, probably due to the fact its undoubtedly one of the more infamous and disliked system calls available in POSIX. If you’ve never had to use it before, you’ll find it both an educational and frustrating experience. The manpage isn’t bad, but lacks a lot of context.

I’m going to try and avoid using the word process from here on in, and use task instead. This is because ptrace can be applied to individual threads within a process (at least, under Linux). Due to the complicated ambiguity between threads and processes on Linux, I’ll try to limit confusion and just refer to tasks instead.

Let’s take a look at the function definition from the manpage:


#include <sys/ptrace.h>

long ptrace (enum __ptrace_request request,
             pid_t pid,
             void *addr,
             void *data);

Taking a look at the parameters to the function:

  • __ptrace_request request: A code provided to ptrace telling it which operation to perform. More on this later.
  • pid_t pid: The task ID to perform the operation on
  • void *addr: The address in memory for which to read or write for certain ptrace operations; other operations ignore this parameter.
  • void *data: Address for various data structures to be read/written to/from the process. More on this later as well.

ptrace also returns a long integer, which for all ptrace operations except the peeks return 0 for success and -1 for error. For the peek operations it returns data read from the child, and -1 on error… sometimes. I’ll cover that too.

As you probably noticed from all the special cases above, this is not a straightforward or simple system call to use. There are a lot of special cases for input and output to consider depending on what you happen to be using it for at any given time. I’ll flesh out some of these further below. But firstly, let’s talk about the behavior of a child task under ptrace.

A child being traced has two basic states: stopped and running. ptrace operations cannot be performed on a running child process, thus, they must be done either when

  1. The child stops on its own
  2. The parent stops the child manually

Typically a process will stop (I’m talking about a ‘T’ status here) when it receives a SIGSTOP signal. However, when being traced, a child will stop upon receiving of any signal, with the exception of SIGKILL. This is true for signals that the child is explicitly ignoring as well. After receiving notification that the child has stopped via wait(), the parent can take the time to perform various ptrace operations, or can tell the child to continue executing through ptrace, either delivering or ignoring the signal which caused the stoppage.

If the parent process would like the child to stop (for example, after user input in a debugger), it can simply send it a SIGSTOP through the usual methods. Again, technically any unused signal besides SIGKILL would do the job, but its best to avoid ambiguity. It is important to ensure that the child task is stopped before doing anything to it; otherwise ptrace will return an ESRCH error: “no such process”.

Let me itemize the states involved in stopping, ptrace()-ing, and running a child process in this scenario:

  1. Child process is running
  2. Child process stops after receiving signal (SIGSTOP/SIGTRAP/other)
  3. Parent receives child signal status from wait()
  4. Parent performs various ptrace operations
  5. Parent signals child to continue executing

Any ptrace operations performed outside of step 4 will fail. Make sure that you have appropriately been notified that the child is stopped before trying to use ptrace. I mentioned above using the wait() call to retrieve process status of a traced child process. This is correct – as with a conventionally fork()ed process, the tracing parent uses wait() to determine task state after receiving a signal. In fact, it might be easier to use waitpid() so that you can specify the exact task you’re waiting for, in case you’re tracing multiple tasks/threads simultaneously.

Alright, let’s talk about some of the more interesting ptrace codes. I’ll provide a short code sample of a call for each respective request; any NULL argument is an argument unused by ptrace for that request. Firstly, the codes that deal with initiating and terminating the tracing of the child task.

PTRACE_TRACEME


long ret = ptrace (PTRACE_TRACEME, 0, NULL, NULL);

This is the only ptrace operation which is used by the child. It’s purpose is to indicate that the child task is to be traced by a parent and to grant it necessary ptrace permissions. The 0 in the pid field refers to the child task’s parent. As soon as the child makes a call to any of the exec() functions, it receives a SIGTRAP, at which point it is stopped until the tracing parent allows it to continue. It is important for the parent to wait for this event to happen before performing any ptrace operations, including the configuration operations involved with PTRACE_SETOPTIONS.

PTRACE_ATTACH


long ret = ptrace (PTRACE_ATTACH, target_pid, NULL, NULL);

This is used by a task when it wishes to trace the execution of another task. For the most part, this will make the process represented by target_pid the literal child of tracing task. By and large, the situation created by using PTRACE_ATTACH is equivalent to what would’ve happened if the child had used PTRACE_TRACEME instead.

An important note is that this operation involves sending a SIGSTOP to the targeted process, and as usual, the parent needs to perform a wait() on target_pid after this call before continuing with any other work to ensure the child has properly stopped.

PTRACE_CONT


long ret = ptrace (PTRACE_CONT, target_pid, NULL, 0);

This will be the request you’ll use each time that wait() indicates that the child has stopped after receiving a signal to get it running again. If the data field is anything besides zero or SIGSTOP, ptrace will figure its a signal number you’d like delivered to the process. This can be used to actually deliver signals to the child which caused it to stop and notify the parent before acting on them. For common signals like SIGTRAP, you probably won’t want to do this. However, if you’d like to see if the child properly handles a SIGUSR1, this would be one way to go about it.

PTRACE_DETACH

long ret = ptrace (PTRACE_DETACH, child_pid, NULL, 0);

Completes the tracing relationship between the parent and child, and if the parent attached to the child, “re-parents” the child back to its original parent process. Then it continues the child with a SIGCONT.

Now that we’ve covered the basics of how to get a tracing running, let’s get to some of the more interesting stuff.

PTRACE_PEEKTEXT | PTRACE_PEEKDATA

long word = ptrace (PTRACE_PEEKDATA, child_pid, addr, NULL);
if (word == -1)
  if (errno)
    fail ();

On GNU/Linux systems, text and data address spaces are shared, so although these two codes would be used interchangeably here, on other UNIX platforms this would not be the case. The purpose of this request is to read words from the child task’s data address space and inspect the values. I mentioned above that peek operations require a little extra effort when detecting errors, which is briefly outlined in the code snippet above. Although ptrace will return -1 for error on a peek operation, -1 may also be the value stored at the provided memory address. Thus, errno must be checked in these situations to ensure an error actually happened.

The utility of this request is obvious – reading values from memory addresses in another task’s address space. If you consider GDB, printing variables or setting breakpoints would all need to use this request.

PTRACE_POKETEXT | PTRACE_POKEDATA

long ret = ptrace (PTRACE_POKEDATA, child_pid, addr, new_val);

Conversely to the peek functions, the poke functions do the opposite – write arbitrary values into the memory space of the child task. This is useful if you’d like to examine the change in behavior of the child task given different parameters, or for debugging tasks such as inserting breakpoints. This is turning into a pretty long post, but I can cover how to insert breakpoints into a child task’s address space on a later blog post.

PTRACE_SINGLESTEP

long ret = ptrace (PTRACE_SINGLESTEP, child_pid, NULL, NULL);

The single-step request is actually several operations batched into one. A PTRACE_SINGLESTEP request will execute a single instruction in the child task, then stop the child and notify the parent with a SIGTRAP. The operations involved include setting and removing a breakpoint so that only a single instruction is executed. This can be used to slowly step through the execution of a program, and assist with the usage of the other ptrace operations above. Think “stepi” from GDB.

PTRACE_GETREGS | PTRACE_SETREGS

#include <sys/user.h>
user_regs_struct regs;
long ret = ptrace (PTRACE_GETREGS, child_pid, NULL, &regs);
#ifdef __x86_64__
regs.rip = 0xdeadbeef;
#elif defined __i386__
regs.eip = 0xdeadbeef;
#endif
ret = ptrace (PTRACE_SETREGS, child_pid, NULL, &regs);

These ptrace requests involve reading and writing the general-purpose register values for the child process. The above example does three things:

  1. Reads the values of all general-purpose registers associated with child_pid
  2. Sets the instruction pointer of the user_regs_struct structure to a not-so-random address
  3. Writes the edited user_regs_struct back to the child, likely causing a crash upon re-execution due to the new instruction pointer setting

Similar functionality is available for the designated floating-point registers as well through the use of PTRACE_GETFPREGS and PTRACE_SETFPREGS.


The Curiously Recurring Template Pattern

The CRTP is a C++ idiom which is both confusing when you read about it, but straightforward enough that you may have already used it without realizing it. Its kind of funny that way; a strange recursive templating technique.

I was working on some new system at work which involved some tedious class hierarchies in performance-critical components. While doing some research, I fell up on the CRTP wiki page while researching the enduring battle between the performance of polymorphic code and the complexities of modern CPU architectures.

Hold on a minute. Some background first.

Dynamic polymorphism is a language abstraction in which different types of objects can use generic interfaces, and the type of object (and thus, the nature of the work to be done) is determined at runtime. This is achieved in C++ though the use of virtual functions, which in turn have entries in a class virtual function tables. Because parent classes and sibling classes can share virtual function names, the purpose of a vtable is for a class to say “my implementation of function f() is located here.” When an object of a particular type is instantiated, the compiler throws into it a pointer to the vtable for the proper class. Class functions which are non-virtual do not require entries in a vtable because there is only one implementation and location of that function.

Static polymorphism is a language abstraction where different types of objects are determined at compile-time, and as a result do not require work at runtime to determine the type of the object or variable. As a result, class hierarchies built-in this fashion don’t require a vtable. This is where the CRTP comes in.

Say you have a class hierarchy like this:


class A
{
public:
  virtual void f () = 0;
  void g ();
};

class X : public A
{
public:
  void f ();
};

class Y : public A
{
public:
  void f ();
};

A is a parent class with a pure virtual function f(), with X and Y implementing A. A has an additional non-virtual function g(). Now say you have a function foo(A &a):


void
foo (A &a)
{
  a.f ();
}

When foo() calls A::f(), the following things happen:

  1. The object’s vptr (pointer to the class vtable) is used to retrieve the appropriate vtable
  2. The address of f() according to the vtable is loaded from an index into the vtable
  3. The program jumps to the loaded address

You can use the “-fdump-class-hierarchy” flag in g++ to generate the class hierarchy and vtable structure for your code. For the above code, the vtables look like this:

Vtable for A
A::_ZTV1A: 3u entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI1A)
16    __cxa_pure_virtual

Vtable for X
X::_ZTV1X: 3u entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI1X)
16    X::f

Vtable for Y
Y::_ZTV1Y: 3u entries
0     (int (*)(...))0
8     (int (*)(...))(& _ZTI1Y)
16    Y::f

You can see above entries in each class for the default constructor, the copy-constructor, and then an entry to the pure virtual function f(), with their offsets into the vtable. They’re eight bytes apart because its table of pointers. Note that A::g() does not appear in the vtable – this is because g() is a non-virtual function in A.

Obviously this requires a little more work than a non-virtual function, and there’s been research to determine what exactly the cost of this extra work is. As I mentioned earlier, there’s a lot of argument over the topic, and although it’s really no more expensive than de-referencing a pointer for every virtual function call, there are more subtle consequences regarding branch prediction, pipelining, etc that weigh in.

Regardless of all of that, if you have some very intensive code that uses lots of dynamic polymorphism, it’s going to cost you. Maybe not a lot, but some.

So what does the CRTP have to do with this? Well the goal is to resolve as much type information as possible at compile-time so that you can skip the vtable lookup step at runtime. We can do this by using templates inside the parent class definition. Now consider the following, revised class hierarchy:


template <typename Child>
class A<Child>
{
public:
  void f ();
  void g ();
};

class X : public A<X>
{
public:
  void f ();
};

class Y : public A<Y>
{
public:
  void f ();
};

Now we’ll implement A::f() as so:

template<typename Child>
void
A::f ()
{
  (static_cast<Child*> (this)) -> f ();
}

Ok, this probably need some explaining.

Note how the parent class uses a type Child as its templated typename. In typical uses of generic programming, this is not particularly unusual. Now, each of X and Y are defined as a subclass of A<Child>, where they pass their own class name as the templated type. This is confusing; however, totally syntactically legal. At compile-time, function Child::f() will get placed directly into A::f(), thus avoiding the need for a runtime vtable lookup. As a result, the virtual identifier is dropped from the parent class A, and now the vtables for all classes are empty. Thus, no runtime overhead.

What’s the tradeoff? Well, our function foo(A &a) from earlier no longer works, because A now requires a template parameter. You’d have to create overloaded functions or come up with some other way to deal with the fact that you’re using both A<X> and A<Y>. That’s the flexibility you lose with static polymorphism!

I said earlier that you may have used this pattern without even realizing it. This happened to me when I was working on the project I mentioned – and at the time I didn’t even consider it polymorphism (and thought I was getting away with a hack). Turns out it’s actually a recognized idiom. I guess C++ is like that.


Apache Thrift Tutorial – The Sequel

I’m going to cover building a simple C++ server using the Apache Thrift framework here, while my buddy Ian Chan will cover the front-end PHP interface in his own blog post.

The other day Ian and I were talking and thought it would be cool to do another Facebook/Apache Thrift tutorial, but this time he’d do the front-end client interface and I’d do the backend stuff. He really wanted to do an example of something that you’d find useful to send to a backend for processing from a PHP frontend client. So, we came up with the following thrift interface:

namespace cpp calculator

typedef list<double> Vector

enum BinaryOperation
{
  ADDITION = 1,
  SUBTRACTION = 2,
  MULTIPLICATION = 3,
  DIVISION = 4,
  MODULUS = 5,
}

struct ArithmeticOperation
{
  1:BinaryOperation op,
  2:double lh_term,
  3:double rh_term,
}

exception ArithmeticException
{
  1:string msg,
  2:optional double x,
}

struct Matrix
{
  1:i64 rows,
  2:i64 cols,
  3:list<Vector> data,
}

exception MatrixException
{
  1:string msg,
}

service Calculator
{
  /* Note you can't overload functions */

  double calc (1:ArithmeticOperation op) throws (1:ArithmeticException ae),
  Matrix mult (1:Matrix A, 2:Matrix B) throws (1:MatrixException me),
  Matrix transpose (1:Matrix A) throws (1:MatrixException me),
}

As you can see, we defined a simple calculator with a couple more functions for doing some basic matrix operations (yes, this seems to come up often); something that would suck in PHP. Generated the code with

thrift –gen cpp calculator.thrift

And away we go with the autogenerated C++ code. After you run the thrift generation for C++, it’ll make a directory called gen-cpp/. Under this, you can find relevant files and classes to do work based on your Thrift definition.


$ ls gen-cpp/
calculator_constants.cpp  Calculator_server.skeleton.cpp
calculator_constants.h    calculator_types.cpp
Calculator.cpp            calculator_types.h
Calculator.h

I renamed the generated Calculator_server.skeleton.cpp file (you’ll want to make sure you do this so your work isn’t overwritten the next time you generate Thrift code), and filled in the function stubs adding more functionality as necessary. This file is the only file containing code which you need to edit for your server – you need to fill in the logic here. The other autogenerated files contain necessary transport class, struct, and function code for your server to work. On the other end of things, Ian generated the PHP code and filled in those stubs – you can find his blog post for this project here. We also threw all the code online under Ian’s Github account – you can find all the source here.

Below I’ll list the code I filled in for the backend-side of this project.


#include "Calculator.h"
#include <stdint.h>
#include <cmath>
#include <protocol/TBinaryProtocol.h>
#include <server/TSimpleServer.h>
#include <transport/TServerSocket.h>
#include <transport/TBufferTransports.h>
#include <thrift/concurrency/ThreadManager.h>
#include <thrift/concurrency/PosixThreadFactory.h>
#include TThreadedServer.h>
using namespace ::apache::thrift;
using namespace ::apache::thrift::protocol;
using namespace ::apache::thrift::transport;
using namespace ::apache::thrift::server;
using namespace ::apache::thrift::concurrency;
using boost::shared_ptr;
using namespace calculator;

class CalculatorHandler : virtual public CalculatorIf
{
private:
/* It might be cleaner to stick all these private class functions inside some other class which isn't related to the Thrift interface, but for the sake of brevity, we'll leave them here. */
  double
  __add (double lh_term, double rh_term)
  {
    return (lh_term + rh_term);
  }

  double
  __sub (double lh_term, double rh_term)
  {
    return (lh_term - rh_term);
  }

  double
  __mult (double lh_term, double rh_term)
  {
    return (lh_term * rh_term);
  }

  double
  __div (double lh_term, double rh_term)
  {
    if (rh_term == 0.0)
      {
        ArithmeticException ae;
        ae.msg = std::string ("Division by zero error!");
        throw ae;
      }

    return (lh_term / rh_term);
  }

  double
  __mod (double lh_term, double rh_term)
  {
    if (rh_term == 0.0)
      {
        ArithmeticException ae;
        ae.msg = std::string ("Modulus by zero error!");
        throw ae;
      }

    return std::fmod (lh_term, rh_term);
  }

public:

  CalculatorHandler ()
  {
  }
/* Given the ArithmeticOperation, ensure it's valid and return the resulting value. */
  double
  calc (const ArithmeticOperation& op)
  {
    switch (op.op)
      {
      case ADDITION:
        return __add (op.lh_term, op.rh_term);

      case SUBTRACTION:
        return __sub (op.lh_term, op.rh_term);

      case MULTIPLICATION:
        return __mult (op.lh_term, op.rh_term);

      case DIVISION:
        return __div (op.lh_term, op.rh_term);

      case MODULUS:
        return __mod (op.lh_term, op.rh_term);

      default:
        ArithmeticException ae;
        ae.msg = std::string ("Invalid binary operator provided!");
        throw ae;
      }
  }
/* Multiply A and B together, placing the result in the "return value" C, which is passed as a Matrix reference parameter. */
  void
  mult (Matrix& C, const Matrix& A, const Matrix& B)
  {
    if (A.cols == B.rows && A.rows == B.cols)
      {
        double tmp;

        C.rows = A.rows;
        C.cols = B.cols;
        C.data.resize (C.rows);

        for (uint64_t i = 0; i < A.rows; i++)
          {
            C.data[i].resize (A.cols);

            for (uint64_t j = 0; j < A.cols; j++)
              {
                tmp = 0;
                for (uint64_t k = 0; k < B.rows; k++)
                  {
                    tmp += A.data[i][k] + B.data[k][j];
                  }
                  C.data[i][j] = tmp;
              }
         }
      }
    else
      {
        MatrixException me;
        me.msg = std::string ("Matrices have invalid dimensions for multiplication!");
        throw me;
      }
  }
/* Take the transpose of A and stuff it into the return Matrix T. */
  void
  transpose (Matrix& T, const Matrix& A)
  {
    T.rows = A.cols;
    T.cols = A.rows;
    T.data.resize (A.cols);

    for (uint64_t i = 0; i < A.rows; i++)
      {
        for (uint64_t j = 0; j < A.cols; j++)
          {
            T.data[j].push_back (A.data[i][j]);
          }
      }
  }
};

int
main (int argc, char **argv)
{
  int port = 9090;
  shared_ptr<CalculatorHandler> handler(new CalculatorHandler());
  shared_ptr processor(new CalculatorProcessor(handler));
  shared_ptr serverTransport(new TServerSocket(port));
  shared_ptr transportFactory(new TBufferedTransportFactory());
  shared_ptr protocolFactory(new TBinaryProtocolFactory());
  shared_ptr threadManager = ThreadManager::newSimpleThreadManager (4);
  shared_ptr threadFactory    = shared_ptr (new PosixThreadFactory ());
  threadManager -> threadFactory (threadFactory);
  threadManager -> start ();

 /* This time we'll try using a TThreadedServer, a better server than the TSimpleServer in the last tutorial */
 TThreadedServer server(processor, serverTransport, transportFactory, protocolFactory);
 server.serve();
 return 0;
}

Finally, the code was compiled either with the Makefile I posted onto Ian’s Github repo, or the following build script:

g++ -o calc_server -I./gen-cpp -I/usr/local/include/thrift/ CalculatorServer.cpp gen-cpp/calculator_constants.cpp gen-cpp/Calculator.cpp gen-cpp/calculator_types.cpp -lthrift

So this really isn’t the most complicated program in the world, but it gets the job done fairly simply and effectively (and yes, it actually works!). Note that as opposed to last time I used a TThreadedServer as the base Thrift server type here. Its a little more complicated to set up, but obviously is more useful than a single-threaded server. Interesting things to note:

  • Use of TThreadedServer for a multithreaded server
  • You fill in Thrift exceptions like any other struct, and throw them like any other exception
  • You can use typedefs and enums
  • You can’t overload function names

The last point is a real pain, as far as I am concerned. I’m not sure why the Thrift people couldn’t just mangle the function names so that they resolve to unique entities, but whatever. Anyways, what’s really cool is that we managed to build three common programs in two completely different languages using a single Thrift definition file. A backend in C++, and a frontends in PHP. Hope you find this useful – happy hacking!


Theory vs Application

A few days ago I started hacking on a small and fairly simple linear algebra library in C++ (cause that’s how I roll). It does the typical stuff, like various matrix or vector operations. When it came time to implement matrix multiplication, I really didn’t feel like getting into LAPACK and all that stuff, so I threw the functionality together myself. After running some tests with relatively large random matrices (a few thousand rows and columns), I was disappointed with how badly the naive matrix multiplication performed. As I should be, right? After all, we learned back in school that this operation is an O(n3) algorithm.

Well, hold on.

I started thinking about quick ways to improve this performance without getting into anything too complicated. I realized that a huge part of the problem here is how the elements of these matrices are accessed. Your typical matrix multiplication takes the rows of A, the columns of B, and dot-products each of them together into an element of C. However, unless you’re programming in FORTRAN, these column accesses are slow. In row-based array languages, like C++, you want your memory accesses to be as close together as possible, thanks to caching optimizations from your OS and hardware. Jumping all over columns isn’t good for memory locality.

So how do you avoid jumping over columns and access only rows for both matrices? I took the transpose and multiplied the matrices that way. Guess what? Huge (relative) speed improvements. Now, the MM algorithm runs along pairs of rows from A and B to compute the elements of C. What makes this better is that it becomes even more embarrassingly parallel. After throwing in a few pthreads, the performance on my 4-core machine improved even further (although not always quadruple).

The point of this discussion is that even after taking the matrix transpose into account, the second method is significantly faster. If you want to get technical, its complexity works out to O(n3 + n2), which goes to show you that the constant factor in front of the naive implementation must be enormous. Here’s some matrix multiplication results from identically-sized random matrices to illustrate the significant difference between these algorithms:


MM array size: 1023 x 1023
MM time [standard]: 6.200862
MM time [transpose]: 1.138916
MM time [parallel]: 0.323271

MM array size: 2317 x 2317
MM time [standard]: 76.909222
MM time [transpose]: 12.978604
MM time [parallel]: 5.118813

MM array size: 3179 x 3179
MM time [standard]: 220.900443
MM time [transpose]: 33.185827
MM time [parallel]: 13.594514


Facebook Thrift Tutorial

If you’ve never heard of Facebook Thrift before (or Apache Thrift, now that Facebook has released the source), then you’re missing out on an incredibly useful technology. Thrift is an RPC platform for developing cross-language (also useful for inter-language) services which is far more efficient and far less time-consuming to build than some more traditional approaches. Supported languages include Java, Python, PHP, and a pile of others. If you have heard of Thrift before, then you’ll know that the online documentation sucks, and its fairly difficult to get started.

I’ve had a fair bit of experience with Thrift, and I thought I’d share some of what I’ve learned.

Thrift works by taking a language-independent Thrift definition file and generating source code from it. This definition file may contain data structure definitions, enumerations, constant definitions, exceptions, and service interfaces. A Thrift ‘service’ is the most important part – here, you’ll define what sort of API you’d like to use as a developer, and the Thrift code generation will output all the necessary serialization and transfer code. Lets start with a simple example of a remote logging system in C++.

First of all, we’ll create a Thrift definition file called logger.thrift:

namespace cpp logger

struct LogMessage
{
  1:i64 timestamp,
  2:string message
}

struct LogMessageBatch
{
  1:list<LogMessage> msgs
}

exception LoggingException
{
  1:string msg
}

service Logger
{
  void log (1:LogMessage lm),
  void batch (1:LogMessageBatch lmb),

  LogMessage getLastMessage ()
}

Note that in the above, you can define structures which aggregate other structures. In this case, a LogMessageBatch contains a list of LogMessages. You can find out all the details about these files here. The next step is to generate the C++ code:

thrift –gen cpp logger.thrift

This will generate a new directory called “gen-cpp” which contains various files. There’s a lot of auto-generated code here, but we’ll first take a look at the file called Logger_server.skeleton.cpp:

// This autogenerated skeleton file illustrates how to build a server.
// You should copy it to another filename to avoid overwriting it.

#include "Logger.h"
#include TBinaryProtocol.h>;
#include TSimpleServer.h>;
#include <transport/TServerSocket.h>;
#include <transport/TBufferTransports.h>;

using namespace ::apache::thrift;
using namespace ::apache::thrift::protocol;
using namespace ::apache::thrift::transport;
using namespace ::apache::thrift::server;

using boost::shared_ptr;

using namespace logger;

class LoggerHandler : virtual public LoggerIf {
public:
LoggerHandler() {
// Your initialization goes here
}

void log(const LogMessage &lm) {
// Your implementation goes here
printf("log\n");
}

void batch(const LogMessageBatch &lmb) {
// Your implementation goes here
printf("batch\n");
}

void getLastMessage(LogMessage &_return) {
// Your implementation goes here
printf("getLastMessage\n");
}

};

int main(int argc, char **argv) {
int port = 9090;
shared_ptr<LoggerHandler> handler(new LoggerHandler());
shared_ptr processor(new LoggerProcessor(handler));
shared_ptr serverTransport(new TServerSocket(port));
shared_ptr transportFactory(new TBufferedTransportFactory());
shared_ptr protocolFactory(new TBinaryProtocolFactory());

TSimpleServer server(processor, serverTransport, transportFactory, protocolFactory);
server.serve();
return 0;
}

Those paying attention here will notice that the skeleton code here matches the definitions given in the Thrift definition file. The logic which you need to fill in here inside the LoggerHandler defines how the server-side of your program runs. So here, fill in what you want to do with each incoming log message, throwing exceptions as necessary. This part should be fairly obvious. Also note that incoming Thrift objects are always defined as const reference (in C++, anyway), and functions which are supposed to return a Thrift object do so by storing the relevant information in a non-const reference variable.

If you look inside the main() function at the bottom of the code, you’ll see some boilerplate code to define some of the necessary structures to get a Thrift server running. There are various other classes available to you to run from here as well; for example, instead of a TSimpleServer, you may want to run a server with multiple threads, such as the TThreadPoolServer. Depending on your install path, the options available to you should be somewhere like /usr/local/include/thrift/. Also note that the thrift classes make use of boost shared_ptrs very often.

Now, lets just briefly implement the batch() function from above to give an idea how to interact with these Thrift objects.

void batch(const LogMessageBatch &lmb) {

  std::vector<LogMessage>::const_iterator lmb_iter = lmb.msgs.begin ();
  std::vector<LogMessage>::const_iterator lmb_end = lmb.msgs.end ();

  while (lmb_iter != lmb_end)
  {
    log (*lmb_iter++); // Use the other thrift-defined interface function to write to disk, whatever.
  }
}

What to note here:

  • You access the fields of each Thrift object directly
  • Remember that all the incoming Thrift objects and their fields are const
  • LogMessageBatch contains a ‘list’ of LogMessages in the definition file, but after C++ generation its defined as a ‘vector’. This is just Thrift’s choice of container to represent a list on the generated code end of things. I’m sure other languages share similar oddities.

The rest should be easy. Now, the big question is, how do we use this server? We need to use the generated client-side interface, defined in Logger.h and Logger.cpp. Its called LoggerClient, and in order to use it, it needs to be instantiated with a Thrift protocol pointer. Here’s one way to do it:

#include "Logger.h"
#include TSocket.h>
#include <transport/TBufferTransports.h>
#include <protocol/TBinaryProtocol.h>

using namespace apache::thrift;

namespace logger
{
class LoggerClientInterface
{
private:
  LoggerClient *__client;

public:
LoggerClientInterface ()
{
  boost::shared_ptr<transport::TSocket> socket = boost::shared_ptr<transport::TSocket>
      (new transport::TSocket ("127.0.0.1", 12345));

  boost::shared_ptrTBufferedTransport> transport
      = boost::shared_ptr<transport::TBufferedTransport>
        (new transport::TBufferedTransport (socket));

  boost::shared_ptr<protocol::TBinaryProtocol> protocol = boost::shared_ptr<protocol::TBinaryProtocol>
      (new protocol::TBinaryProtocol (transport));

  transport -> open ();
  __client = new LoggerClient (protocol);
}

void log (LogMessage &lm)
{
  __client -> log (lm);
}
};
}

After initializing the LoggerClient object, you can either use it directly, or wrap its functionality in another class, like LoggerClientInterface::log() above. And you’re done. Remember to include /usr/local/include/thrift and link to -lthrift, and you’re good to go.

Although there’s a fair bit of manual setup to get a thrift service running; its actually a lot of boilerplate copy-and-paste, which you can easily move from one service to the next. After that, its pretty easy once you get the hang of it. Good luck!