Scripting languages, such as Python tend not to be the fastest of the pack. However, an aspect that is sometimes even more frustrating, and which has been more of a true annoyance lately is the way that they use (or abuse) memory. This issue really hit home a few weeks ago when I wanted to convert a medium sized graph (~10k vertices) from one format to another. For small graphs, I usually just pop open IPython, read in the graph using NetworkX and spit it out in the new format. I know that NetworkX’s I/O isn’t the fastest in town, but I can always go an get a coffee . . . right? Yes, except when my machine runs out of RAM and starts paging to disk! In this case, even roasting, grinding and brewing some high quality coffee beans won’t provide an inadequate implementation with the time it needs to complete what should be a fairly straightforward and simple operation. I asked myself, “is it reasonable for a library (or language) to require 12GB of memory to read in a 10 thousand vertex (sparse) graph?” My first answer, of course, was “no.”
I was writing this horrible inefficiency off, at least in part, to the NetworkX “turtles (dicts) all the way down” representation of graphs. Python dictionaries are actually relatively fast, but they’re not exactly designed for memory efficiency. Last week, however, one of my lab mates was trying to speed up some code that computes the density of a set of (1000) relatively small graphs. My initial inclination was that the code was primarily I/O bound, as she was parsing the input graphs from a set of ASCII edge list files. I suggested that she might substantially mitigate the I/O cost by reading all of the graphs in once and then processing them in memory. What really struck me when doing some profiling on the code, however, was the memory footprint that results from this fairly straightforward approach.
To give a notion of the relatively small scale of the processing, all of the separate graph files, concatenated together, constitute only 403M of text (44.72 million edges). I wrote a simple Python script to read this file in and to store the edges as a list of tuples — thus avoiding the potential memory “black hole” imposed by NetworkX. So what’s the result of this fairly direct approach? Well, it’s the rapid consumption of ~6.5G of memory. Yup, to read in and store a set of edges that, in plain text, constitue a mere 403M, Python requires 6.5G of memory! Given that most machines today ship with 8-16G of memory, this won’t necessarily result in the horrible paging-to-disk experience I recounted earlier, but it’s still pretty dismal.
How does C++ (translated in the most direct way I can think of) compare on this simple task? Just shy of 700M. Like I said, the translation from Python to C++ was as straightforward as I could make it. Python’s list of tuples of “int”s became a C++ vector of tuples of “size_t”s. The C++ code was also about an order of magnitude faster, but that’s somewhat beside the point I’m trying to make here. This truly simple task requires no complicated data structures, no complex bookkeeping, and nominally, nothing but trivial work for Python’s GC. So, what really astounded me is that, for such a simple task, Python consumed an order of magnitude more memory than C++. Obviously, the argument for “high-level” languages like Python, is that one is willing to incur some overhead for the benefit obtained by working at the “higher level” of abstraction provided by the language. However, every once in a while, I think we should take a step back and look at the type and magnitude of the overhead we’re incurring and ask ourselves how reasonable we think the trade-offs we’re making are. I’d argue that, for a task like this, an order of magnitude in speed and memory doesn’t really seem like an acceptable overhead. For the curious observer, I’ve put the code to perform this test up on GitHub here. I’ve left the data file out (b/c it’s 403M), but you can generate an equivalent file with the GenEdges script in the repo. I might update this repo with some other languages in the future if I continue to be astounded by certain languages’ inefficiencies.
I’ve been noticing memory blowups like this from Python lately, too, namely from Dicts and Sets. Do you have any idea how/why Python is using so much more memory for the list of int tuples?
I don’t have much insight on why, but I did dig a little deeper using the heapy tool to examine the memory usage. Here are the results:
Partition of a set of 125715438 objects. Total size = 5535599424 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 44726048 36 3220334512 58 3220334512 58 tuple
1 80969471 64 1943267304 35 5163601816 93 int
2 176 0 369123712 7 5532725528 100 list
3 11809 0 983136 0 5533708664 100 str
4 323 0 267464 0 5533976128 100 dict (no owner)
5 67 0 215368 0 5534191496 100 dict of module
6 201 0 214488 0 5534405984 100 dict of type
7 1633 0 209024 0 5534615008 100 types.CodeType
8 1596 0 191520 0 5534806528 100 function
9 201 0 178784 0 5534985312 100 type
So, it looks like it’s actually just using a ton of memory holding the ints and tuples. Also, if you divide the size by the counts there, you can see that the per-tuple and per-int overheads are really high (72 and 24 bytes respectively). Since the count of tuples and ints is close to what we expect, I don’t really know what to blame the crazy
memory usage on other than the tuple / int class bloat.