Efficiency isn’t just speed

Scripting languages, such as Python tend not to be the fastest of the pack. However, an aspect that is sometimes even more frustrating, and which has been more of a true annoyance lately is the way that they use (or abuse) memory. This issue really hit home a few weeks ago when I wanted to convert a medium sized graph (~10k vertices) from one format to another. For small graphs, I usually just pop open IPython, read in the graph using NetworkX and spit it out in the new format. I know that NetworkX’s I/O isn’t the fastest in town, but I can always go an get a coffee . . . right? Yes, except when my machine runs out of RAM and starts paging to disk! In this case, even roasting, grinding and brewing some high quality coffee beans won’t provide an inadequate implementation with the time it needs to complete what should be a fairly straightforward and simple operation. I asked myself, “is it reasonable for a library (or language) to require 12GB of memory to read in a 10 thousand vertex (sparse) graph?” My first answer, of course, was “no.”

I was writing this horrible inefficiency off, at least in part, to the NetworkX “turtles (dicts) all the way down” representation of graphs. Python dictionaries are actually relatively fast, but they’re not exactly designed for memory efficiency. Last week, however, one of my lab mates was trying to speed up some code that computes the density of a set of (1000) relatively small graphs. My initial inclination was that the code was primarily I/O bound, as she was parsing the input graphs from a set of ASCII edge list files. I suggested that she might substantially mitigate the I/O cost by reading all of the graphs in once and then processing them in memory. What really struck me when doing some profiling on the code, however, was the memory footprint that results from this fairly straightforward approach.

To give a notion of the relatively small scale of the processing, all of the separate graph files, concatenated together, constitute only 403M of text (44.72 million edges). I wrote a simple Python script to read this file in and to store the edges as a list of tuples — thus avoiding the potential memory “black hole” imposed by NetworkX. So what’s the result of this fairly direct approach? Well, it’s the rapid consumption of ~6.5G of memory. Yup, to read in and store a set of edges that, in plain text, constitue a mere 403M, Python requires 6.5G of memory! Given that most machines today ship with 8-16G of memory, this won’t necessarily result in the horrible paging-to-disk experience I recounted earlier, but it’s still pretty dismal.

How does C++ (translated in the most direct way I can think of) compare on this simple task? Just shy of 700M. Like I said, the translation from Python to C++ was as straightforward as I could make it. Python’s list of tuples of “int”s became a C++ vector of tuples of “size_t”s. The C++ code was also about an order of magnitude faster, but that’s somewhat beside the point I’m trying to make here. This truly simple task requires no complicated data structures, no complex bookkeeping, and nominally, nothing but trivial work for Python’s GC. So, what really astounded me is that, for such a simple task, Python consumed an order of magnitude more memory than C++. Obviously, the argument for “high-level” languages like Python, is that one is willing to incur some overhead for the benefit obtained by working at the “higher level” of abstraction provided by the language. However, every once in a while, I think we should take a step back and look at the type and magnitude of the overhead we’re incurring and ask ourselves how reasonable we think the trade-offs we’re making are. I’d argue that, for a task like this, an order of magnitude in speed and memory doesn’t really seem like an acceptable overhead. For the curious observer, I’ve put the code to perform this test up on GitHub here. I’ve left the data file out (b/c it’s 403M), but you can generate an equivalent file with the GenEdges script in the repo. I might update this repo with some other languages in the future if I continue to be astounded by certain languages’ inefficiencies.

Posted in Uncategorized | 2 Comments

And time marches forward . . . just like a progress bar.

It’s only February, and already, I’ve missed a post.  I guess I’ll add that to the queue and try to post twice in one of the coming weeks.

This week I have a relatively short post.  However, if it’s still too long to read, at least make sure you check our the nifty progress bar that was the result of me playing with the C++11 chrono facilities.

Anyone who knows me will know that I’m fond of progress bars for any computation that requires more than a few minutes.  If there’s a process that’s going to take more than, say, 5 minutes, I find it incredibly helpful to know if I should expect to be waiting 10 minutes (and go get a cup of coffee) 100 minutes (and go to the gym) or longer (and go write my own progress bar for whatever poor application I happen to be using).  For most of my previous C++ projects, I used Boost’s progress bar.  Unfortunately, this reliable old tool has been deprecated, and including it can lead to conflicting declarations with some of Boost’s newer and very useful timing utilities.

After moving away from Boost’s solution, I searched around for a while and settled on ezProgressBar; a lightweight set of headers that provide pre-stylized progress bars.  This library is nice, and doesn’t conflict with any of the other Boost timing utilities, but one day, after just arriving at work, I decided I needed a Monday morning project.  What I mean by a Monday morning project is a small but satisfying project, the first revision of which can usually be completed in an hour or so, to help get you into a groove when you’re not really motivated to jump into a larger or more foreboding piece of code.  So I decided that I’d retrofit ezProgressBar to use C++11’s new timing facilities (i.e. std::chrono).  The motivation for this (besides the normal Monday morning project motivation) was to get a little bit of practice with C++11’s new date / time classes, and to get rid of some of the platform specific cruft, present in the aforementioned library, which becomes unnecessary if one can assume C++11 compatibility of the hosting compiler.

All things considered, I must say that I’m fairly impressed with the design of std::chrono. They’ve largely eliminated the nagging questions like “what unit of time does this function take?”  Also, duration_cast get’s rid of the tedious work of converting between different units of time, thus helping to eliminate those annoying bugs that result from difficulty with enumerating the correct number of zeros.  The standard library also makes it easy for you to define your own time durations, and to have these user defined durations fit in almost seamlessly with the built-in types. Let’s take a look at a straightforward translation of ezProgressBar’s durationString function into C++11. First, the original:

[cc lang=’cpp’ ]
std::string secondsToString(long long t) {
int days = t/86400;
long long sec = t-days*86400;
int hours = sec/3600;
sec -= hours*3600;
int mins = sec/60;
sec -= mins*60;
char tmp[8];
std::string out;

if (days) {
sprintf(tmp, “%dd “, days);
out += tmp;
}

if (hours >= 1) {
sprintf(tmp, “%dh “, hours);
out += tmp;
}

if (mins >= 1) {
sprintf(tmp, “%dm “, mins);
out += tmp;
}

if (sec >= 1) {
sprintf(tmp, “%ds”, (int)sec);
out += tmp;
}

if (out.empty())
out = “0s”;

return out;
}
[/cc]

Now, the C++11 code:

[cc lang=’cpp’ ]
std::string durationString( duration t ) {
typedef std::chrono::duration> days;

std::stringstream out(std::stringstream::out);
//std::string out;

if ( t >= days(1) ) {
auto numDays = duration_cast(t);
out << numDays.count() << "d "; t -= numDays; } if ( t >= hours(1) ) {
auto numHours = duration_cast(t);
out << numHours.count() << "h "; t -= numHours; } if ( t >= minutes(1) ) {
auto numMins = duration_cast(t);
out << numMins.count() << "m "; t -= numMins; } if ( t >= seconds(1) ) {
auto numSecs = duration_cast(t);
out << numSecs.count() << "s"; } std::string tstring = out.str(); if (tstring.empty()) { tstring = "0s"; } return tstring; } [/cc] The point of showing this snippet is not the code length ... both are relatively short pieces of code and, undoubtedly, both could be made shorter. The point I want to draw attention to is how much less type and unit "fudging" is going on the second sample than the first. The first version takes a long long . . . why? What units does this number represent, second, milliseconds, something else? The second version takes a duration (of time). While the duration itself doesn't specify the units in the type declaration, it "knows" about the units. This is how the duration_cast allows us to cast our duration to an array of other units. The second point to contrast is all of the "magic" numbers that appear in the first snippet. Now, these numbers aren't really "magic" and most people looking at the code will get where, for example, the 3600 comes from. However, if I have a duration, t, and I want to know what it means in hours, minuetes, or seconds, I can just do the following: [cc lang="cpp"] auto numHours = duration_cast(t);
auto numMinutes = duration_cast(t);
auto numSeconds = duration_cast(t);
[/cc]

I don’t have to know what unit t is originally in terms of, and this makes extracting the information I want much clearer and less error prone. Also, defining your own duration (and having it act like a builtin type) is really straightforward:

[cc lang=”cpp”]
typedef std::chrono::duration> weeks;
[/cc]

This defines a “weeks” type. The ratio of one second to one week is 604800:1. To express this, we use the std::ratio type. If we had wanted to create a type to represent tenth’s of a second, we could just do

[cc lang=”cpp”]
typedef std::chrono::duration> deciseconds;
[/cc]

One does have to know that the “unit” time is seconds, but everything else falls out nicely and the representation is pretty clear (at least to me).

Akin to the duration type, there’s also a time_point type, and arithmetic operations on these work just like one might expect. Take for example, the following:

[cc lang=”cpp”]
auto dt = endTime-startTime;
[/cc]

We subtract the time_point startTime from the time_point endTime, and we get a duration that we can cast into any units we like. What’s the number of seconds represented by dt?

[cc lang=”cpp”]
auto numSecs = duration_cast(dt);
[/cc]

It’s as simple as that. We don’t need to know if startTime and endTime are millisecond counts, or seconds since epoch.

All in all, I’m fairly happy with the timing facilities in C++11, and I encourage you to check them out. You can take a look at the source code of the progress bar if you want to see (a little bit more), or just look through the docs and play around with std::chrono yourself. Either way, I got a nice little progress bar library out of my efforts, and I even added a few new features beyond the retrofitting of the timing facilities (compile the progress bar with HAVE_ANSI_TERM for a little bit of fun). I’ll probably continue to add more features and configuration options as I find them useful, so feel free to use the progress bar in your C++ project or to fork me on github!

Posted in Uncategorized | 3 Comments

How should we deal with “unacceptable” laws?

Note: I plan to return to the “beauty in programming language” topic soon, but the story this week was just too absurd and pressing to not comment on.

So, the change in the law concerning the legality of unlocking one’s cell phone has been all over the news this week.  Basically, the law makes the unauthorized unlocking of one’s phone a crime punishable by a fine of up to $500,000 and 5 years in prison.  Let’s take a step back and consider how fundamentally broken a system that allows a law like this actually is.  Conceivably, one could be fined an outrageous sum of money and actually spend time in prison for unlocking his own cellphone.  While I expect (at least I really hope) that we won’t see this law broadly applied, especially to the extent that anyone receives this outrageous maximum penalty, the fact that such a law even exists and that such a penalty is even possible, indicates how regressive we’ve become concerning issues of ownership and property as it relates to software, hardware and hacking.

Essentially, we’ve arrived at a place, legally, where, from a property rights standpoint, you can no longer be rightly said to own most of the hardware and software you buy.  Put aside for the moment all of the tricky questions related to owning software (i.e. what it means to own a series of perfectly and nearly freely reproducible bits).  We’re not talking about anti-piracy laws here, though most of the government’s current suggestions on that front look Draconian at best; we’re talking about the freedom to unlock your cellphone, install a custom firmware and remove encryption protocols that prevent you from uploading, say, a program that you wrote to your device.  What we’re talking about here is placing a potentially extremely punitive penalty on tinkering.  This law is technically under the regulatory authority of the librarian of Congress, granted by the counterproductive, ill-informed, naive and, frankly, dangerous Digital Millennium Copyright Act (DMCA) signed into law under President Clinton in 1998.  Since its inception, the DMCA has been a largely destructive force, hampering progress and leading companies to put ever more restrictive conditions on how you can use their products that you purchase.  It’s also the law that gives digital rights management (DRM — a.k.a. digital restrictions management) enforcement its teeth, since it makes circumventing encryption, even for purposes perfectly benign purposes that don’t involve accessing or distributing sensitive information, illegal.

So the question that all of this really raises in my mind is “How should we deal with these types of unacceptable laws?”  I don’t have an answer to this question, and the question itself is rather loaded and difficult to characterize.  For example, what constitutes unacceptable, and who gets to apply that label to a law?  However, assuming that a law, like the one this post is about, passes a reasonable test to be considered unacceptable, how should we, the informed but legislatively powerless citizenry respond?  How should one inform his representatives, who themselves may truly be ignorant of the underlying issues, that such a law is simply unacceptable?  How does one organize a response that is significant enough that the message to the legislature (like the anti-SOPA protests) is crystal clear —  “this will not stand.”  I don’t know what the most effective strategy is to ensure that our copyright related laws regain some modicum of rationality, but I do know that we need to figure out how to send a clear, unambiguous and decisive message to those who draft and execute these laws.  At least the message itself is simple; it’s “NO!”

Posted in Uncategorized | Leave a comment

Python: The Deceptively Beautiful Serpant