Little Tools for “Big” Data

Published July 15, 2016

Big Data has been a hot topic in technology circles for several years now. Thanks to hardware and software advances and improved purchasing and deployment processes, we are able to employ more systems to analyze more data than ever before.

For nearly all consumers of technology and “big data,” scaling to process data is a question of economics, not a question of technological feasibility. Some constraints that organizations must recognize include infrastructure maintenance and personnel skills and availability, in addition to storage size, processing speed, and money. Some other limiting factors that come into play:

Single nodes typically do not have more than a few gigabytes of memory. Larger data sizes must be processed by loading only a portion of data into memory at a time.
Storage input/output bandwidth and latency will have a greater impact on performance.
It is time-consuming and often costly to transfer data over the network. Care must be taken to not do this more often than is necessary.
Problems with algorithmic efficiency can become painfully obvious. A naive algorithm may work great with a thousand records and tolerably with one hundred thousand records, but could fall over completely with ten million records.
Even with efficient algorithms, many data processing tasks require significant CPU resources to complete in a timely fashion.

If producing a weekly report in ten minutes instead of two hours requires the deployment of a system that costs one day per month in maintenance, is this a net loss of a net gain? It can only be a net gain if you can extract the value of one day of maintenance costs per month from having the weekly reports one hour and fifty minutes sooner than you otherwise would. Otherwise, it’s a net loss.

Big Data vs. “big” data

When processing and transforming data, technologists are often confronted with data that looks neither “big” nor “small.” Small data—the top size ranging from many megabytes to perhaps a gigabyte—can generally be processed in seconds or minutes on a personal computer. Big Data—which starts somewhere in the 100 terabyte range—is best handled with the technologies and infrastructure designed for the purpose of working with data of that size. At the small end, the infrastructure maintenance and personnel costs of novel data processing are trivial, and at the large end these costs are at least predictable, if not always easily manageable.

But what about the broad range between a gigabyte and 100 terabytes? This is where technologists must closely consider the constraints of the software and hardware systems they’re using to process the information.

What is Big Data? If it fits on a smartphone, it’s not Big Data. If it doesn’t fit on a smartphone but is smaller than 100TB, it’s probably still not Big Data.

How can we effectively process data that isn’t really big enough for Big Data tools (like Apache Spark, Hadoop, and other similar database management systems) but that is still big enough that the simplest systems might not always work?

An Example with “Big” Data

Simple text processing tasks can be done relatively easily in a wide variety of general-purpose programming and scripting languages. However, extra care must be taken in these languages–the simplest and most natural ways to express operations may prevent code from handling larger input sizes. The goal is to save us some inessential code dealing with regular expression compilation and reading input and avoid accidental complexity.

As part of some exploratory programming at Unizin, I had to extract all of the HTTP URLs from unformatted text content.

For purposes of this article, we’re going to use the entire content of all Wikipedia articles which the Academic Torrentssite has listed for download. Uncompressed, this is around 46 gigabytes of XML text—a perfect example of our “big enough” data, not Big Data.

This example will assume a specific implementation of the AWK language, GNU Awk. It includes several enhancements that make data processing tasks easier, and it includes better profiling and debugging support than other AWK implementations.

The basic structure of an AWK program is:

PATTERN_1 { 
     ACTION_1
 }
 
 PATTERN_2 {
     ACTION_2
 }

The input to an AWK program is a sequence of records read from an input source. For each record, PATTERN_1, PATTERN_2, and so on are matched against the record in turn, and for each matching pattern, the corresponding ACTION is executed.

First, let’s specify what we’d like to see. Given some input like:

I searched http://www.google.com and http://bing.com, and found
  the page https://en.wikipedia.org/wiki/AWK_(disambiguation). <a
 href="https://unizin.org">Unizin</a> [link](http://example.com)

We’d like our program to output:

http://www.google.com
 http://bing.com

https://en.wikipedia.org/wiki/AWK_(disambiguation)
 https://unizin.org
 http://example.com

So it needs to tolerate things like Wikipedia’s penchant for including parentheses in URLs, fitting URL data into plaintext prose (including punctuation), and common markup. The link extractor does not need to be 100% accurate, but obviously we’d like to cover as many edge cases as possible.

The first thing to note is that we’d really like to apply our patterns/actions to words (strings separated by whitespace), rather than lines, which are the default record type. In GNU Awk, this is as simple as adding the line…

BEGIN { RS = "[[:space:]]+" }

…to our script. This changes the record separator (RS) from the default—a newline—to any string matching the regular expression [[:space:]]+, which is one or more whitespace characters. So the simple script…

BEGIN {
     RS = "[[:space:]]+"
 }
 
 { 
     print
 }

…outputs…

I

searched
 http://www.google.com
 and
 http://bing.com,
 and
 found
 the
 page
 https://en.wikipedia.org/wiki/AWK_(disambiguation).
 <a
 href="https://unizin.org">Unizin</a>
 [link](http://example.com)

Without the BEGIN pattern/action, it would have simply copied the input to the output. (An action without a pattern is run on every record.)

This record separator customization is not even an option in, say, Python. To iterate through a file a “word” at a time would require you to read a file a line at a time, and then split the line into words and iterate over those words. Moreover, that approach assumes that your input file even has newline characters in it. If it doesn’t, Python will read the entire file into memory. You can avoid this behavior with custom code or third-party libraries, but the implementation of the record/field/pattern paradigm in AWK makes this customization much more natural.

The next step is to match the input record against a pattern that is likely to detect URLs. Much has been written online about the topic of using regular expression patterns to match URLs, but having a more fully-featured language like AWK means that we can use regular expressions to find URL candidates, and then apply additional logic to ensure that we don’t include any extra characters. Let’s take a look at a regular expression we can use to match URLs:

/(https?:\/\/[a-zA-Z0-9\-\._~:\/?#\[\]@!

amp;'()*+,;=]+)\ [^a-zA-Z0-9\-\._~:\/?#\[\]@!

amp;'()*+,;=]?/

This isn’t intended to be a post about regular expressions, so a summary in prose will have to do. The above pattern matches records that contain http:// or https://, and the first group of the match will contain all remaining characters allowed in URLs up to the first character that is not allowed in URLs.

Let’s put that regular expression into a pattern in the AWK program. The following code…

BEGIN { 
     RS = "[[:space:]]+"
 }
 
 /(https?:\/\/[a-zA-Z0-9\-\._~:\/?#\[\]@!

amp;'()*+,;=]+)\  [^a-zA-Z0-9\-\._~:\/?#\[\]@!

amp;'()*+,;=]?/ {      print   }

…when run on our sample input now returns only the records that contain URLs:

http://www.google.com
 http://bing.com,
 https://en.wikipedia.org/wiki/AWK_(disambiguation).
 href="https://unizin.org">Unizin</a>
 [link](http://example.com)

We’ve thrown out all of the records that don’t contain a URL. In order to throw out the characters that don’t belong in a URL, let’s start by selecting only the portion of our record that matches the (https?:\/\/[a-zA-Z0-9\-\._~:\/?#\[\]@!$&'()*+,;=]+) group. Standard AWK does not capture groups in regular expressions, but GNU Awk does if we use its match() function:

BEGIN { 
     RS = "[[:space:]]+"
 }
 
 match($0, /(https?:\/\/[a-zA-Z0-9\-\._~:\/?#\[\]@!

amp;'()*+,;=]+)\  [^a-zA-Z0-9\-\._~:\/?#\[\]@!

amp;'()*+,;=]?/, result) {      print result[1]  }

The above modification to the program stores the groups in a variable called result, and only prints the URL group. On our input, it will return:

http://www.google.com
 http://bing.com,
 https://en.wikipedia.org/wiki/AWK_(disambiguation).
 https://unizin.org
 http://example.com)

Almost there! What’s left to fix are the URL-legal characters that are highly unlikely to be part of the URL, such as commas, periods, and close parentheses that don’t have an open parenthesis earlier in the URL. We’ll add all these special cases in a switch statement:

BEGIN {
     RS = "[[:space:]]+"
 }

 match($0, /(https?:\/\/[a-zA-Z0-9\-\._~:\/?#\[\]@!

amp;'()*+,;=]+)\  [^a-zA-Z0-9\-\._~:\/?#\[\]@!

amp;'()*+,;=]?/, result) {      url = result[1]      # remove punctuation that usually appears in plaintext      ending_char = substr(url, RLENGTH)      switch (ending_char) {      case ".":      case ",":          url = substr(url, 0, RLENGTH - 1)          break      case ")":          if (!index(url, "(")) {              url = substr(url, 0, RLENGTH - 1)          }          break      default:          break      }      print url  }

The cleanup code relies on the AWK substr() function to pull out a portion of the value in url. When we use this version of our code, we get the following output:

http://www.google.com
 http://bing.com
 https://en.wikipedia.org/wiki/AWK_(disambiguation)
 https://unizin.org
 http://example.com

Success! Now, let’s use this on the “big” Wikipedia article data. Since we don’t know how long this will take, first let’s sanity-check it with the first ten matches by stopping the program using the Unix head utility:

bash-3.2$ gawk -f ~/urlfind.awk enwiki-20140707-pages-articles.xml | head -n 10
 http://www.mediawiki.org/xml/export-0.8/
 http://www.w3.org/2001/XMLSchema-instance
 http://www.mediawiki.org/xml/export-0.8/
 http://www.mediawiki.org/xml/export-0.8.xsd
 http://en.wikipedia.org/wiki/Main_Page
 http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin___Anarchism__from_the_Encyclopaedia_Britannica.html
 http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin__Anarchism__its_philosophy_and_ideal.html
 http://www.theanarchistlibrary.org/HTML/The_Anarchist_FAQ_Editorial_Collective__An_Anarchist_FAQ__03_17_.html#toc2



http://www.marxists.org/archive/malatesta/1930s/xx/toanarchy.htm
 http://www.theglobeandmail.com/servlet/story/RTGAM.20070514.wxlanarchist14/BNStory/lifeWork/home/

Looks realistic! Let’s run it on the whole file. (Warning: you must have over two gigabytes free to store the results. Yes, the URLs in Wikipedia alone are more than 2GB of data.)

Using All Available Resources

If you run this GNU Awk script against the Wikipedia data, it will likely take a very long time to complete. On my MacBook Pro, I used the time command to find out exactly how long:

$ time gawk -f ~/urlfind.awk enwiki-20140707-pages-articles.xml > ~/wikipediaurls.txt
 real   159m30.769s
 user   147m25.364s
 sys    1m7.527s

Two hours, forty minutes. It’s very frustrating to wait that long. However, there is some hope of performance improvement. GNU Awk is not multithreaded, so all but one of my computer’s CPU cores are still idle when this code is being run. Since I don’t care about the order that URLs come out, and detecting a URL in an input record doesn’t depend on anything that previous records contained, this problem is embarrassingly parallel.

Luckily, I don’t need to rewrite anything in order to use all of the CPU available to me. In fact, all I need to do is install GNU Parallel and read enough of its documentation to use it successfully with my script. GNU Parallel has been coded to automatically query your system’s hardware capabilities and more optimally sequence jobs with varying workloads. (See the author’s USENIX paper for an overview of it.):

$ time parallel --pipepart --round-robin gawk -f ~/urlfind.awk :::: enwiki-20140707-pages-articles.xml > ~/wikipediaurls.txt
 real   46m13.778s
 user   326m15.683s
 sys    5m56.697s

The “–pipepart” flag allows parallel to read data from disk as fast as possible, and “–round-robin” allows it to use a more efficient job allocation strategy because the order of the output doesn’t matter to me. From over two and a half hours to 46 minutes! My notebook computer’s fan was loud and the machine heated up, and all of this extra energy expenditure paid off pretty well in terms of time to completion.

This dramatic difference illustrates that it needn’t take a lot of infrastructure maintenance and conceptual overhead to significantly improve processing throughput. If I use the distributed computing features of GNU Parallel, I can even enlist the other computers around my home to extract the data still more quickly—all without changing the original simple script. We took a little text processing language from the 1970s and a simple command-line tool and scaled the same code up to “big” data with everyday items found around the home!

Too Long, Didn’t Read

Hopefully the examples here have helped you think about your own “big” data problems—are they really that “big?” Do you need a big cluster of machines or expensive consultants to process them? Or can applying some simple concepts and tools from system administration help save you some time and money?