Author Archives: michael cvet

Photography in San Francisco Bay

Since moving to San Francisco a few months ago, I’ve taken the opportunity to go out on a lot of hikes and feed my new obsession with panoramic photography. I thought I’d share a few of my favourites that I’ve taken so far. I use a Nikon D90 with an 18-105mm, Hugin for photo stitching, and GIMP for touchups.

Here I’m looking Northeast from the Land’s End main trail in the morning – the Golden Gate Bridge in the distance is still mostly covered by fog:

This was taken at one of the beaches off of the Land’s End main trail:

The Pacific from Muir Lookout:

Looking North from Angel Island towards Tiburon:

Finally, looking down towards the Pacific from Mount Tamalpais:

I’ve posted more photos to my public minus profile.


The Strassen Algorithm in C++

Quite some time ago I wrote about some experiments I did with some matrix multiplication algorithms. I’ve finally got around to cleaning up and posting (most of) the source code I used to generate the data in that post.

The source is now hosted in my GitHub repository here.

In the project is a few toys for playing around with matrices, but the meat is in

src/strassen/strassen_matrix_multiplier.hpp
src/strassen/parallel_strassen_matrix_multiplier.hpp

There’s a few other algorithms in there for comparison’s sake, as well as a simple testing client that provides some timing information. I’m using CMake to generate the Makefiles; if you’ve never used it before, you should. It’s great.

Some things to note:

  • The Strassen algorithm has a very high constant factor; for this reason, for smaller matrices or sub-matrices, the code falls back to an optimized naive multiplication algorithm
  • The algorithm needs to pad matrices with zeroes if they’re either non-square and/or have dimensions which are not a power of two. For this reason, if your input matrices have a square dimension something like N = (2n + 1), the algorithm works with a matrix almost twice that size. For that reason it will perform badly on matrices with this characteristic.
  • This implementation is mostly a demonstrative toy in the hope someone finds it useful; its not particularly optimized and the parallel Strassen class is kind of naive.

Happy hacking!


Customizing TR1 unordered_map Hashing and Equality Functions

A common problem people run into when using the C++ std::tr1::unordered_map or std::tr1::unordered_set is when they want to use a custom or aggregate type as the key in the hash table, and aren’t sure how to do it.

There isn’t really any official documentation about this; you can find some forum and StackOverflow discussions which may help; but I’ll run through a detailed example here too. Let’s start by taking a look at how the unordered_map class is defined:

template<class Key,
         class T,
         class Hash = hash<Key>,
         class Pred = std::equal_to<Key>,
         class Alloc = std::allocator<std::pair<const Key, T> > >
  class unordered_map;

The templated types in the class are:

  • class Key: key type (duh)
  • class T: value type
  • class Hash: hash function for type Key
  • class Pred: an equality predicate for type Key
  • class Alloc: the memory allocator for <Key, T>

The function represented in class Hash produces a hash value given a reference to type Key. This is the value that the map will use to locate the appropriate bucket to insert the corresponding value. The function represented by Pred is a predicate; it returns true if the values (as opposed to hash values) of Key types are equal. If the Key type is a class with operator== () defined, then defining a separate Pred function won’t be an issue.

Changing the Hash and Pred types is easy, but non-obvious. The most common reasons for this are either using a custom type as the key for the hash table, as I mentioned above, or for using a custom hash function to achieve better performance from the container.

Let’s say we have the following setup:

typedef struct
{
  uint32_t a;
  double x;
  double y;
} AggregateKey;

{ ... }

AggregateKey k;

{...}

std::tr1::unordered_map<AggregateKey, int> M;
M[k] = 1;

{...}

Attempting to compile this will generate a ton of error output. Depending on the situation, it’ll complain about Pred with something like

error: no match for ‘operator==’ in ‘__x == __y’

or about Hash with

undefined reference to `std::tr1::hash<AggregateKey>::operator()(AggregateKey) const'

Fix the issue by defining the following functors:

typedef struct
{
  long operator() (const AggregateKey &k) const { return my_hash_fnct (k); }
} AggregateKeyHash;

typedef struct
{
  bool operator() (const AggregateKey &x, const AggregateKey &y) const { return my_eq_test (x, y); }
} AggregateKeyEq;

AggregateKey k;

{...}

// Now, hash value generation and equality&nbsp;testing are
// defined for the AggregateKeyHash&nbsp;type, so declare the
// map using the functors&nbsp;above in the object's template list
std::tr1::unordered_map<AggregateKey, int, AggregateKeyHash, AggregateKeyEq> M;
M[k] = 1;

{...}

The same logic applies to the std::tr1::unordered_set, with the exception that its template list doesn’t include a class T value type. Also note that you can grab the any map’s default hash function and equality predicate at any time from the object:

std::tr1::unordered_map<int, int> M;

const std::tr1::unordered_map<int, int>::hasher &hfn = M.hash_function ();
const std::tr1::unordered_map<int, int>::key_equal &eqfn = M.key_eq ();

long h = hfn (123);
bool b = eqfn (1, 2);

LinkedIn Maps: Your Professional Network as a Pretty Graph

A few days ago, LinkedIn Labs released their InMaps application, allowing you to visualize your LinkedIn network – or possibly more importantly, your sub-networks and their relationships. It’s published as an image to outsiders, but you’re able to interact with your map, zooming in and out of collections of contacts, examining their details and interconnectivity. Mine is here:

I spent a considerable amount of time playing around with this tool; examining what relationships the various clusters of people in my network have with each other. I compared my graph to several of my friends’; it was interesting to note the differences. Especially with regards to how closely grouped the clusters of contacts are, and how inter-connected the various clusters appear. I think my sub-networks probably appear more inter-connected than many.

It’s also interesting to see the ‘connectors’ – contacts which link clusters together. These are likely  friends or colleagues with strong ties to multiple groups from your career; the larger the dot, the more connections they share with your network. The “long arcs” are cool too – someone buried in one of the clusters who happens to know someone else in some other cluster.

InMaps did a great job colouring my contacts as well – I was impressed with its accuracy. I had a couple of moments where I said “Oh, I didn’t know X knew Y. Cool.” It goes to show you the power of data visualization.

Now all someone needs to do is build an app that overlays your LinkedIn contacts with your contacts from Facebook, Twitter, and other networks. Hopefully it’ll be 3D too. And not built with Flash.


Hash Functions

Hash functions are a particular interest of mine, and have been ever since back in university when I began writing password crackers for GNU/Linux systems. I like talking about them so I figured I’d ramble a little bit on here about the subject.

There are two principal uses for these functions:

  1. Cryptography
  2. Hash table mapping

These two uses should probably be considered mutually exclusive. A crypto-based hash function is (relatively) expensive to compute, and a quick hash-table hash function does not actively take steps to prevent reversibility. In cryptography, hash functions are primarily used for things like computing (relatively) unique digests of messages or files, or creating digital signatures and certificates. The most common algorithms belonging to this category are MD5 and SHA-1. However, both of these algorithms have had flaws discovered by security researchers, and are considered compromised for critical cryptographic use. Vulnerabilities have yet to be discovered in the SHA-2 family and WHIRLPOOL which are currently considered safer alternatives.

Hash functions are typically described by an expression such as this:

h : {0,1}* -> {0,1}n

Meaning, a hash function h(M) takes an arbitrarily long binary message M and computes a finite binary sequence of length n. Because we’re mapping from a space of infinite binary combinations to a finite string, we can’t expect that the hash of every message is unique. This is the Pigeonhole Principle. But there are some requirements h(M) must meet in order to be considered a secure hash function:

  1. Pre-image resistance: Given hash output H, it should be computationally infeasible to find an M such that h(M) = H. This should require about 2n operations.
  2. Second pre-image resistance: Given M and its output from h(M) -> H, it should be computationally infeasible to find another M’ such that h(M) = h(M’). Again, this should require about 2n hashing operations.
  3. Collision resistance: It should be computationally infeasible to find any two different messages M and M’ such that h(M) = h(M’), taking at least  2n/2 operations. It is likely that only these many hash operations would be required because of the Birthday Paradox.

If a proposed attack on a cryptographic hash function violates one of the above properties, the system has been compromised. Note that the biggest difference between (2) and (3) above is that (3) applies to any pair of messages M, whereas (2) refers to finding a collision for a specific message.

In general, a secure hash function will satisfy the following (non-exhaustive list) of properties:

  • The algorithm should be efficient; it should be relatively quick to compute the message digest
  • The produced hash should be irreversible
  • The hash is deterministically created, and all hashes are the same size
  • The hash should indicate nothing about the input message (see avalanche effect)
  • Produced hashes should represent a good distribution over the available finite hash domain

Both MD5 and the current SHA families, as well as many other cryptographic hash functions, are based on an iterative principle called the Merkle-Damgaard construction. It works something like the following:

  1. Pad the message to be hashed until its length is some multiple of some desired message block size
  2. Divide the message into blocks
  3. Iteratively mix and compress data from the message and from previous rounds together, passing through nonlinear one-way compression functions, and possibly mixed with auxiliary disturbance vectors
  4. Condense the intermediate hash data into the final hash value of some standard size and output

Graphically, it looks like this:

  • B1..6 are the blocks of the message to be hashed
  • The initial mixing values are just some numbers selected to mix in with the initial message block
  • Each function f() is a one-way compression function (or collection of them)
  • The final function g() collects and finalizes all data from the iterative rounds, and produces the hash

Although this has been the way cryptographic hash functions have traditionally been designed, there are some newer algorithms which have a different model. For example, MD6 uses a hash tree to allow parallel computation of the hash. WHIRLPOOL uses a Miyaguchi-Preneel based compression function instead of the traditional Merkle-Damgaard based one. There’s been some interesting research into the parallel computation of hashes; its been reported that parallel implementations of MD6 can reach over 1 Gb/s in processing speed. Note, however, that non-cryptographic hashes designed for speed can operate at two to three times that in a single thread of execution. MurmurHash is probably my favourite algorithm of this category.

The National Institute of Standards and Technology is actually holding a cryptographic hash function competition to replace the SHA-2 family of hash functions with a modern and more secure variant, which they’ll call SHA-3.


Thoughts on the Go Programming Language

Part of the description of Go from its homepage states that it’s a “compiled language that feels like a dynamically typed, interpreted language”. In my relatively brief experience with the language, that is a statement I can attest to. I feel like an expert in Go could build natively-compiled applications at a rate comparable to Python over C++ or Java. Part of this has to do with its extensive library, and partly because the language is just not particularly complicated. Not to mention that the language’s concurrency model is very intuitive.

My initial interest in Go stemmed from goroutines. These implement concurrency in an extremely simple and highly-abstracted fashion: concurrently-executing functions. There isn’t much more to it to than that; goroutines are apparently multiplexed over a pool of host threads which are hidden from the programmer. You can let them spin off on their own and terminate, or  synchronize and communicate with them using message-passing in the form of channels. In this sense, concurrency in Go feels a lot like Erlang. Some other cool features include closures and array slices. Technically Go isn’t object-oriented, but you’re able mimic it by using interfaces and embedding methods into structures. Also, Go is garbage-collected.

Actually, the features of Go make it feel like a combination of a bunch of different languages. There’s obvious inspiration from C, and some from what feels like Java (interfaces, GC), Erlang (message-passing and concurrency), functional languages in general (closures) and Python. Lets dive into an example which illustrates a few of the things I’ve mentioned so far.

Here’s the (highly synthetic) situation: We’d like to compute the square roots of a list of floating-point numbers in parallel using our own hand-rolled function implementing Newton’s Method. We’re going to compute each square root in a separate goroutine and collect each value through a separate channel to that goroutine. To make things more interesting, we’ll package a bunch of it up using a closure.

/* Compute and return the square root of x. Note
 * that return type comes after function parameters */
func newtons_method (x float64) float64 {

	/* No parentheses around conditional */
	if x == 0.0 {
		panic ("Divide by zero!") // throw exception
	}

	/* "Initialize" statement - variable type inferred by
 	 * compiler */
	last := float64(0.0)
	y := float64(x / 2.0)

	/* No "while" loops - iterate "for as long as" condition */
	for math.Fabs(last - y) > 0.0001 {
		last = y
		y = y - ((y * y - x) / (2 * y))
	}

	return y
}

/* Given a slice of floats, compute their square roots in
 * parallel and print the results. Note variable name
 * precedes the type. */
func compute_sqrts (L []float64) {

	/* Type defined at compile-time */
	n := len (L)

	/* A "slice" definition - used here as expandable array */
	var chans []chan float64

	/* Fill the slice with channels */
	for i := 0; i < n; i++ {
		chans = append(chans, make(chan float64))
	}

	/* For index, element in the list */
	for i, x := range L {

		/* Copy the range variables so the closures
		 * reference the right values */
		__x := x
		__i := i

		/* Define a quick closure and execute it in parallel
		 * as a goroutine via the go () statement */
		go func() {
			/* Calculate square root */
			var root float64 = newtons_method (float64(__x))
			chans[__i] <- root /* Write result to channel */
			close(chans[__i])  /* Close the channel */
		} ()
	}

	for i := range chans {
		x := <- chans[i] /* Read from channel */
		fmt.Println ("Square root of ", L[i], " is ", x)
	}
}

Things to note:

  1. Yes, there are like ten different ways to declare variables
  2. Yes, there are like ten different ways to declare for loop conditions (and no while loops)
  3. Yes, opening braces (‘{‘) must go on the same line as function definition or condition
  4. There is a new allocation directive called make(…) in addition to new(…)
  5. This program is primarily imperative
  6. Even though there is flexibility in how you define variables, they are still very much strongly typed

Obviously, the most interesting part of this little program is what’s going on in lines [50-55]. This is a function literal and in Go, all function literals are also closures; meaning, variables referenced by the function literal are in-scope for as long as necessary for the literal to complete. In addition, this function literal is preceded and followed by go and (). This executes the literal inside a goroutine, which as discussed earlier, will run concurrently with the main thread and all other goroutines (within reason).

The literal itself calls the newtons_method() function, and writes the return value into the closure’s channel. It then closes the channel, which would normally indicate that the closure has completed its work and executed. The values written to each of the respective channels will wait to be read, even after the goroutine has exited.

I mentioned earlier that you can play around with some pseudo-OO in Go, so I’ll give a brief example:

/* Interface definition - any structure which
 * implements newtons_method (x float64)
 * also implements this interface */
type SqrtInterface interface {
	newtons_method ()
}

/* Structure definition - this is also
 * a SqrtInterface type */
type SqrtPair struct {
	x float64
	sqrt_x float64
}

/* Overloaded newtons_method() which is bound
 * to pointers of type SqrtPair. Will reference
 * the members of the 'calling' struct. */
func (sp *SqrtPair) newtons_method () {

	/* ... */

	sp.sqrt_x = newtons_method (sp.x)
}

/* This does nothing except call member functions
 * of SqrtInterface interface objects */
func compute_for_pair (sp SqrtInterface) {
	sp.newtons_method ()
}

func main () {

	/* Note sp has type *SqrtPair */
	sp := new (SqrtPair)
	sp.x = 17
	compute_for_pair (sp)
	fmt.Println ("Square root of ", sp.x, " is ", sp.sqrt_x)
}

I hope that blows your mind, because it blew mine. Let me illustrate what’s going on:

  • Define an interface of type SqrtInterface which is basically empty except for a member function called newtons_method()
  • Define a structure of type SqrtPair with a couple of floats
  • Embed a function called newtons_method() into structures of type SqrtPair*
  • SqrtPair now implements the interface SqrtInterface
  • SqrtPair.newtons_method() can now be called on instantiated SqrtPair structures acting as SqrtInterface objects which modifies their internal members even though the method wasn’t present in the initial structure definition

So obviously there’s a lot of cool stuff happening inside Go. TIOBE ranks Go as the world’s 21st most popular programming language at the time of this post (interpret these rankings however you wish), beating out D, arguably one of its closest competitors, at 33rd. GitHub has a huge pile of Go projects. There’s no doubt that Go is a very powerful language, which is even supported by a GCC backend in addition to its standard compiler. It also comes with an enormous standard library with built-in items like HTTP servers. I like Go a lot, but there are some nagging (and probably petty) issues I have with it which probably means I won’t be using it as my language of choice in the near future:

  • Although minimalistic, I find the syntax kind of random. In some sense it can beneficial to mix the best aspects of Python and Erlang and C syntax together, I think it makes the end result kind of messy. For example, spliceVar = append(spliceVar, item) vs. mapVar[key] = item
  • I am frustrated that I can define variables ten different ways, including not needing to specify variable type, but I can’t put the type before the variable definition.
  • Similarly, for a language so flexible, I don’t know why it’s required to put braces on opening lines, and wrap one-line loops and if-statements in braces.
  • No templates. How can there be no templates you ask? There just aren’t. I don’t know why.
  • Although technically not object-oriented, you can mix in elements of OO as you see fit. I feel that this can lead to the same mix of paradigms that drives C++ people insane.
  • “Exported” variables and functions are denoted by title-case, getting rid of ‘extern’ and ‘public’. I don’t know what sort of problem this is trying to solve, but I suspect it can just lead to more confusion.

But to end on a positive note, a list of Go features I love playing around with:

  • Goroutines and channels
  • Closures / function literals
  • Huge standard library, including built-in RPC
  • Slices
  • Multiple return values (so good!)

Learning Functional Languages

I took it upon myself to begin learning Haskell recently. This isn’t my first adventure into the realm of functional languages; I spent some time a year or two ago learning Erlang as well. As someone who spends most of their time developing in (and quite frankly, enjoying) C and/or C++, it might seem like a strange foray into bizzaro-world. However, I don’t have any definite or particular interest in the functional paradigm, only in the sense that it offers a new perspective on the way software is built, and perspective is good.

From the imperative perspective, functional programming is a bizarre and imposing approach to writing programs. These languages are all concerned with things like side effects, mutability, and state, and its often difficult to identify the advantages of functional languages over more conventional ones with respect to these concepts and others. In addition, they often have the stereotype of being less practical and more academic (translation: interesting but useless) technologies – although this is not necessarily the case.

I’ve been learning Haskell from this tutorial, which I recommend not only for its technical completeness, but also the author’s sense of humor. I’m not at the stage yet where I was with Erlang (which was, actually building programs), but I could be if I spent a little more time on the subject. So far, one thing I’ve found to be an incredibly useful aspect of Haskell is:

  • Its clean, rigid syntax

However, the thing that probably drives me the most insane about Haskell is:

  • Its clean, rigid syntax

Make no mistake, that when you begin building some application in Haskell, your code will look pretty much identical to what is produced by your coworkers. If you have some sort of writing style, or you don’t like the default formatting and structure of the language, too bad. It sort of forces you into structuring your application a certain way, and to be honest, this is invaluable when you’re collaborating with others – you should know what to expect in their work.

Haskell people will also argue that the immutability of variables, the brevity of code for tasks, and stateless program design will reduce defects as well in the final product. I can’t find any numbers to back that up, but that seems like it might have a reasonable amount of truth to it. The tradeoff being, spending five years finding a developer who is an expert in Haskell.

I stopped learning Erlang at the time because I couldn’t stand the syntax; Erlang has a lot of very strange characters and formatting in the language. Certainly, you would have to write considerably less code than for an equivalent program in C++, but I’d use Python in that case any day over it.

Anyways, stay tuned for more feedback about Haskell. Should be an interesting adventure.


ptrace() Tutorial

The ptrace() system call provides the unique functionality under many *nix systems allowing a process to examine the data, and control the execution of, another running process. This includes the ability to read and write register values, arbitrary values from memory, and signals. The tracing process (from here on, the parent) may establish this relationship with the process being examined (the child) by either fork()ing the process as a literal child process, or by assuming the temporary parental responsibilities of the running child process by attaching to it during execution. The most useful applications of this system call would be building debuggers and process tracing tools. I’m also going to focus on the GNU/Linux version of ptrace.

There isn’t a ton of online documentation about how to use ptrace, probably due to the fact its undoubtedly one of the more infamous and disliked system calls available in POSIX. If you’ve never had to use it before, you’ll find it both an educational and frustrating experience. The manpage isn’t bad, but lacks a lot of context.

I’m going to try and avoid using the word process from here on in, and use task instead. This is because ptrace can be applied to individual threads within a process (at least, under Linux). Due to the complicated ambiguity between threads and processes on Linux, I’ll try to limit confusion and just refer to tasks instead.

Let’s take a look at the function definition from the manpage:


#include <sys/ptrace.h>

long ptrace (enum __ptrace_request request,
             pid_t pid,
             void *addr,
             void *data);

Taking a look at the parameters to the function:

  • __ptrace_request request: A code provided to ptrace telling it which operation to perform. More on this later.
  • pid_t pid: The task ID to perform the operation on
  • void *addr: The address in memory for which to read or write for certain ptrace operations; other operations ignore this parameter.
  • void *data: Address for various data structures to be read/written to/from the process. More on this later as well.

ptrace also returns a long integer, which for all ptrace operations except the peeks return 0 for success and -1 for error. For the peek operations it returns data read from the child, and -1 on error… sometimes. I’ll cover that too.

As you probably noticed from all the special cases above, this is not a straightforward or simple system call to use. There are a lot of special cases for input and output to consider depending on what you happen to be using it for at any given time. I’ll flesh out some of these further below. But firstly, let’s talk about the behavior of a child task under ptrace.

A child being traced has two basic states: stopped and running. ptrace operations cannot be performed on a running child process, thus, they must be done either when

  1. The child stops on its own
  2. The parent stops the child manually

Typically a process will stop (I’m talking about a ‘T’ status here) when it receives a SIGSTOP signal. However, when being traced, a child will stop upon receiving of any signal, with the exception of SIGKILL. This is true for signals that the child is explicitly ignoring as well. After receiving notification that the child has stopped via wait(), the parent can take the time to perform various ptrace operations, or can tell the child to continue executing through ptrace, either delivering or ignoring the signal which caused the stoppage.

If the parent process would like the child to stop (for example, after user input in a debugger), it can simply send it a SIGSTOP through the usual methods. Again, technically any unused signal besides SIGKILL would do the job, but its best to avoid ambiguity. It is important to ensure that the child task is stopped before doing anything to it; otherwise ptrace will return an ESRCH error: “no such process”.

Let me itemize the states involved in stopping, ptrace()-ing, and running a child process in this scenario:

  1. Child process is running
  2. Child process stops after receiving signal (SIGSTOP/SIGTRAP/other)
  3. Parent receives child signal status from wait()
  4. Parent performs various ptrace operations
  5. Parent signals child to continue executing

Any ptrace operations performed outside of step 4 will fail. Make sure that you have appropriately been notified that the child is stopped before trying to use ptrace. I mentioned above using the wait() call to retrieve process status of a traced child process. This is correct – as with a conventionally fork()ed process, the tracing parent uses wait() to determine task state after receiving a signal. In fact, it might be easier to use waitpid() so that you can specify the exact task you’re waiting for, in case you’re tracing multiple tasks/threads simultaneously.

Alright, let’s talk about some of the more interesting ptrace codes. I’ll provide a short code sample of a call for each respective request; any NULL argument is an argument unused by ptrace for that request. Firstly, the codes that deal with initiating and terminating the tracing of the child task.

PTRACE_TRACEME


long ret = ptrace (PTRACE_TRACEME, 0, NULL, NULL);

This is the only ptrace operation which is used by the child. It’s purpose is to indicate that the child task is to be traced by a parent and to grant it necessary ptrace permissions. The 0 in the pid field refers to the child task’s parent. As soon as the child makes a call to any of the exec() functions, it receives a SIGTRAP, at which point it is stopped until the tracing parent allows it to continue. It is important for the parent to wait for this event to happen before performing any ptrace operations, including the configuration operations involved with PTRACE_SETOPTIONS.

PTRACE_ATTACH


long ret = ptrace (PTRACE_ATTACH, target_pid, NULL, NULL);

This is used by a task when it wishes to trace the execution of another task. For the most part, this will make the process represented by target_pid the literal child of tracing task. By and large, the situation created by using PTRACE_ATTACH is equivalent to what would’ve happened if the child had used PTRACE_TRACEME instead.

An important note is that this operation involves sending a SIGSTOP to the targeted process, and as usual, the parent needs to perform a wait() on target_pid after this call before continuing with any other work to ensure the child has properly stopped.

PTRACE_CONT


long ret = ptrace (PTRACE_CONT, target_pid, NULL, 0);

This will be the request you’ll use each time that wait() indicates that the child has stopped after receiving a signal to get it running again. If the data field is anything besides zero or SIGSTOP, ptrace will figure its a signal number you’d like delivered to the process. This can be used to actually deliver signals to the child which caused it to stop and notify the parent before acting on them. For common signals like SIGTRAP, you probably won’t want to do this. However, if you’d like to see if the child properly handles a SIGUSR1, this would be one way to go about it.

PTRACE_DETACH

long ret = ptrace (PTRACE_DETACH, child_pid, NULL, 0);

Completes the tracing relationship between the parent and child, and if the parent attached to the child, “re-parents” the child back to its original parent process. Then it continues the child with a SIGCONT.

Now that we’ve covered the basics of how to get a tracing running, let’s get to some of the more interesting stuff.

PTRACE_PEEKTEXT | PTRACE_PEEKDATA

long word = ptrace (PTRACE_PEEKDATA, child_pid, addr, NULL);
if (word == -1)
  if (errno)
    fail ();

On GNU/Linux systems, text and data address spaces are shared, so although these two codes would be used interchangeably here, on other UNIX platforms this would not be the case. The purpose of this request is to read words from the child task’s data address space and inspect the values. I mentioned above that peek operations require a little extra effort when detecting errors, which is briefly outlined in the code snippet above. Although ptrace will return -1 for error on a peek operation, -1 may also be the value stored at the provided memory address. Thus, errno must be checked in these situations to ensure an error actually happened.

The utility of this request is obvious – reading values from memory addresses in another task’s address space. If you consider GDB, printing variables or setting breakpoints would all need to use this request.

PTRACE_POKETEXT | PTRACE_POKEDATA

long ret = ptrace (PTRACE_POKEDATA, child_pid, addr, new_val);

Conversely to the peek functions, the poke functions do the opposite – write arbitrary values into the memory space of the child task. This is useful if you’d like to examine the change in behavior of the child task given different parameters, or for debugging tasks such as inserting breakpoints. This is turning into a pretty long post, but I can cover how to insert breakpoints into a child task’s address space on a later blog post.

PTRACE_SINGLESTEP

long ret = ptrace (PTRACE_SINGLESTEP, child_pid, NULL, NULL);

The single-step request is actually several operations batched into one. A PTRACE_SINGLESTEP request will execute a single instruction in the child task, then stop the child and notify the parent with a SIGTRAP. The operations involved include setting and removing a breakpoint so that only a single instruction is executed. This can be used to slowly step through the execution of a program, and assist with the usage of the other ptrace operations above. Think “stepi” from GDB.

PTRACE_GETREGS | PTRACE_SETREGS

#include <sys/user.h>
user_regs_struct regs;
long ret = ptrace (PTRACE_GETREGS, child_pid, NULL, &regs);
#ifdef __x86_64__
regs.rip = 0xdeadbeef;
#elif defined __i386__
regs.eip = 0xdeadbeef;
#endif
ret = ptrace (PTRACE_SETREGS, child_pid, NULL, &regs);

These ptrace requests involve reading and writing the general-purpose register values for the child process. The above example does three things:

  1. Reads the values of all general-purpose registers associated with child_pid
  2. Sets the instruction pointer of the user_regs_struct structure to a not-so-random address
  3. Writes the edited user_regs_struct back to the child, likely causing a crash upon re-execution due to the new instruction pointer setting

Similar functionality is available for the designated floating-point registers as well through the use of PTRACE_GETFPREGS and PTRACE_SETFPREGS.


Virtual Memory (Theory vs Application part 3)

I think I was reading Reddit a little while ago when a pretty interesting article from the ACM Communications by Poul-Henning Kamp came up. It might be a little sensationalist, but its a pretty valid examination of what you can do to program performance when you consider how modern operating systems manage virtual memory. Kamp spent a lot of time building Varnish, and although I’d never heard of it before, it sounds like a very impressive utility. Its an HTTP accelerator and cache whose performance apparently blows Squid out of the water.

The meat of the article discusses how conventional thinking about the implementation of algorithms and data structures on modern computers is broken because they fail to consider the consequences of cache misses and page faults when accessing data.

On a modern multi-issue CPU, running at some gigahertz clock frequency, the worst-case loss is almost 10 million instructions per VM page fault. If you are running with a rotating disk, the number is more like 100 million instructions. What good is an O(log2(n))algorithm if those operations cause page faults and slow disk operations? For most relevant datasets an O(n) or even an O(n2) algorithm, which avoids page faults, will run circles around it.

Kamp provides some images and diagrams to visualize some of what he’s talking about. As a traditionally more systems-focused guy, I might be a little biased, but I feel that modern CS curriculums pump out students who have absolutely no idea about how computers actually work. Knowing the asymptotically optimal algorithm for everything is great; but if your practical implementation sucks because you’re causing paging all over the place, what good is it? A reasonable mixture makes sense – and the curriculum could make room by removing that course on CVS (and maybe even calculus – barf!)

I discussed a little earlier about how more intelligent practical implementations can have enormous effects over more naive solutions even though they remain asymptotically equal. Imagine wasting a hundred million instructions because your algorithm’s implementation unnecessarily left some node on disk!


Parallel MapReduce in Python in Ten Minutes

Almost everyone has heard of Google’s MapReduce framework, but very few have ever hacked around with the idea of map and reduce. These two idioms are borrowed from functional programming, and form the basis of Google’s framework. Although Python is not a functional programming language, it has built-in support for both of these concepts.

A map operation involves taking a function f and applying it on a per-element basis to a given list L. For example, if f was the square-root function, then the map would take the square of each element in L. A reduce operation (also known as folding) is similar in that it also applies a function g to a given list L, but instead of isolating on each element, g systematically accumulates or collapses the contents of L into a single result. The canonical example of this would g performing a summation of the elements in L.

Now, there’s a lot of documentation on MapReduce so I won’t replicate talk about it much here. I’ll just briefly mention that it is an efficient way of distributing the workload of huge embarrassingly parallel problems over multiple processors, or over a grid of machines. The map operation takes input which has been fragmented into pieces by a master node, and applies its function to the input on each of the worker nodes in the system. The data is then organized and the useful information extracted by the reduce step on various worker nodes in the system. Google provides a diagram of the operation in their documentation:

Pretty straightforward... right?

Anyways, this post isn’t specifically about MapReduce, its about Python. I mentioned above that Python provides built-in map and reduce functionality. Lets say we have a problem where we have several lists, and we’d like to know the number of total elements in all the lists combined. In Python, a solution (using lambdas for brevity) might look like this:

a = [1, 2, 3]
b = [4, 5, 6, 7]
c = [8, 9, 1, 2, 3]
L = map(lambda x:len(x), [a, b, c])

# L == [3, 4, 5]
N = reduce(lambda x, y: x+y, L)
# N == 12
# Or, if we want to be fancy and do it in one line
N = reduce(lambda x, y: x+y, map(lambda x:len(x), [a, b, c]))

Now, this is an interesting way to think about getting the combined lengths of these lists, but it isn’t any more efficient than getting the lengths with any other conventional method. But what if we had a large problem which we could solve in parallel?

Let’s pretend we’re interested in finding the frequencies of title-cased words (proper nouns, starts of sentences) in an enormous corpus, and we’d like to use map and reduce to help compute this problem on our multi-core computer. Most people’s first instinct would be “let’s break out the Python threading module!” which, of course, would be a mistake – thanks to something called the global interpreter lock used in the Python interpreter. The GIL basically handicaps parallelism in the Python interpreter, by serializing execution of user-threads for various complicated reasons. A solution is to use the Python multiprocessing module, which will spawn entirely separate processes to execute common code on. And, as you’d expect from Python, it provides you with a super-easy-to-use Pool class to manage your own worker process pool.

Considerably more frightening than the programming language

Probably the best thing about the Pool class is that it provides a map function of its own – which will automatically partition and distribute input to a user-specified function in pool of worker processes. This is like map in the functional sense, but it doesn’t necessarily mean it’s performing the map step in MapReduce – you’ll see what I mean. So let’s see some code for the problem I mentioned above; we’ll start with the map, partition, and reduce functions:

"""
Given a list of tokens, return a list of tuples of
titlecased (or proper noun) tokens and a count of '1'.
Also remove any leading or trailing punctuation from
each token.
"""
def Map(L):

  results = []
  for w in L:
    # True if w contains non-alphanumeric characters
    if not w.isalnum():
      w = sanitize (w)

    # True if w is a title-cased token
    if w.istitle():
      results.append ((w, 1))

  return results

"""
Group the sublists of (token, 1) pairs into a term-frequency-list
map, so that the Reduce operation later can work on sorted
term counts. The returned result is a dictionary with the structure
{token : [(token, 1), ...] .. }
"""
def Partition(L):
  tf = {}
  for sublist in L:
    for p in sublist:
      # Append the tuple to the list in the map
      try:
        tf[p[0]].append (p)
      except KeyError:
        tf[p[0]] = [p]
  return tf

"""
Given a (token, [(token, 1) ...]) tuple, collapse all the
count tuples from the Map operation into a single term frequency
number for this token, and return a final tuple (token, frequency).
"""
def Reduce(Mapping):
  return (Mapping[0], sum(pair[1] for pair in Mapping[1]))

So, what’s going on here? Well, there’s three steps to this process. First of all, the Map function receives independent lists of tokens from the source text. In parallel, it will attempt to strip off any leading or trailing punctuation (lines 12-13), and if it is a properly capitalized word, adds it to the results list with a count of 1 (lines 16-17). All such tokens are processed in this way, and duplicates are also appended to this list.

After the Map operation is complete, the program needs to organize all the processed data. This is where the Partition function comes in. This function receives a huge list of lists – each constituent list was processed by an instance of Map and thus contains a sequence of (token, 1) pairs. This function organizes all the data into a dictionary, so that now each token key has a list of (token, 1) tuples as a value.

The final step is in the Reduce function. It receives arbitrary key:value pairs from the dictionary built in Partition, which it uses to generate the final term counts for title-cased words. In parallel, the list in each dictionary entry is collapsed, summed, and the final count for each token key is returned by each instance of Reduce. The final result is a list of (token, count) tuples, where count refers to the total count of the token over the entire document.

Let’s get to the Python multiprocessing code, and the other stuff we need to make the above functions work. I’ll create a worker pool of eight processes, and distribute the above workload to those processes via the Pool.map function:

import sys
from multiprocessing import Pool
"""
If a token has been identified to contain
non-alphanumeric characters, such as punctuation,
assume it is leading or trailing punctuation
and trim them off. Other internal punctuation
is left intact.
"""
def sanitize(w):

  # Strip punctuation from the front
  while len(w) > 0 and not w[0].isalnum():
    w = w[1:]

  # String punctuation from the back
  while len(w) > 0 and not w[-1].isalnum():
    w = w[:-1]

  return w
"""
Load the contents the file at the given
path into a big string and return it.
"""
def load(path):

  word_list = []
  f = open(path, "r")
  for line in f:
    word_list.append (line)

  # Efficiently concatenate Python string objects
  return (''.join(word_list)).split ()

"""
A generator function for chopping up a given list into chunks of
length n.
"""
def chunks(l, n):
  for i in xrange(0, len(l), n):
    yield l[i:i+n]

"""
Sort tuples by term frequency, and then alphabetically.
"""
def tuple_sort (a, b):
  if a[1] < b[1]:
    return 1
  elif a[1] > b[1]:
    return -1
  else:
    return cmp(a[0], b[0])

if __name__ == '__main__':

  if (len(sys.argv) != 2):
    print "Program requires path to file for reading!"
    sys.exit(1)

  # Load file, stuff it into a string
  text = load (sys.argv[1])

  # Build a pool of 8 processes
  pool = Pool(processes=8,)

  # Fragment the string data into 8 chunks
  partitioned_text = list(chunks(text, len(text) / 8))

  # Generate count tuples for title-cased tokens
  single_count_tuples = pool.map(Map, partitioned_text)

  # Organize the count tuples; lists of tuples by token key
  token_to_tuples = Partition(single_count_tuples)

  # Collapse the lists of tuples into total term frequencies
  term_frequencies = pool.map(Reduce, token_to_tuples.items())

  # Sort the term frequencies in nonincreasing order
  term_frequencies.sort (tuple_sort)

  for pair in term_frequencies[:20]:
    print pair[0], ":", pair[1]

As you can see, with three calls to the Python multiprocessing.Pool class, I can create a worker pool of 8 processes, and perform a distributed map and a distributed reduce operation (lines 64, 70, 76). Pool.map takes care of the partitioning and distribution of the data over the worker pool by itself. Don’t get confused by the two calls to map though – although it is technically a map operation, the first call is mapping a second map function to the data, and the second is mapping a reduce function to the data.

Finally, it prints out the top twenty token:count pairs. As you can see, the Python multiprocessing code is really easy to use. You don’t need to organize your problem as a MapReduce problem in order to properly use the process pool class either – I just used this opportunity for illustration. This definitely isn’t the most efficient example of how to use map and reduce to solve big problems, but I hope it’s a reasonably clear introduction. Happy Hacking!


Follow

Get every new post delivered to your Inbox.