Cardinality Estimation in Linear Time using Sub-Linear Space

sizematters

The cardinality of a collection A (which might be an ordered or unordered list, a set, or what not) is basically the number of unique values in A. For example, the collections [1,2,3,4] and [1,2,1,3,1,4,3] have the same cardinality of 4 (and also correspond to the same set).

Determing the Cardinality of a Collection: The Naive Approach

Consider a collection A=[1,2,1,3,1,4,3]. How can we systematically determine the cardinality of A? Well, here are two of many ways to do this:

  1. First sort A in ascending order. Then, we can perform a linear scan on A to remove duplicates. It’s pretty easy to see how this can be done. Finally, return the size of the possibly trickled-down collection (now a set) obtained. If the initial size of A is n. Then, the cardinality of A, using this method, can be determined in O(n\, log\, n) (if we use merge-sort) time and O(1) extra space.
  2. Use a hash table: Perform a linear scan of A, hashing the values of A. It’s easy to see that cardinality of A is the number of keys in the hash table obtained. This uses O(n) time but O(n) extra space also.

Notice that we can’t do any better (lower upper-bound) than O(n) because we have to look at the entire input (which is of size n). But can we determine the cardinality of A in O(n) time using sub-linear space (using strictly smaller space than n)?

That’s where probability comes in.

Linear Probabilistic Counting

This is a probabilistic algorithm for counting the number of unique values in a collection. It produces an estimation with an arbitrary accuracy that can be pre-specified by the user using only a small amount of space that can also be pre-specified. The accuracy of linear counting depends on the load factor (think hash tables) which is the number of unique values in the collection divided by the size of the collection. The larger the load factor, the less accurate the result of the linear probabilistic counter. Correspondingly, the smaller the load factor, the more accurate the result. Nevertheless, load factors much higher than 1 (e.g. 16) can be used while achieving high accuracy in cardinality estimation (e.g. <1% error).

Note: in simple hashing, the load factor cannot exceed 1 but in linear counting, the load factor can exceed 1.

Linear counting is a two-step process. In step 1, the algorithm allocates a bit map of a specific size in main memory. Let this size be some integer m. All entries in the bit map are initialized to “0”‘s. The algorithm then scans the collection and applies a hash function to each data value in the collection. The hash function generates a bit map address and the algorithm sets this addressed bit to “1”. In step 2, the algorithm first counts the number of empty bit map entries (equivalently, the number of entries set to “0”). It then estimates the cardinality of the collection by dividing this count by the bit map size m (thus obtaining the fraction of empty bit map entries. Call this V_n).

Plug in V_n and m into the equation r = m\, log_e\, V_n to obtain r which is the estimated cardinality of the collection. The derivation of this equation is detailed in this paper[1].

Here’s my simple implementation of the Linear Probabilistic Counting Algorithm.

Errors in Counting

Although the linear probabilistic counting method is faster than deterministic approaches, it might sometimes fail to be accurate as explained above. So this method should be used only when 100% accuracy is not needed. For example, in determining the number of unique visitors to a website.

Probabilistic Alternatives to Linear Probabilistic Counting

The HyperLogLog algorithm is such an alternative. It also runs in linear time (linear in the size of the initial collection). But the HyperLogLog counter usually gives a more accurate estimation of the cardinality count and also uses less space. See this paper for more details on the HyperLogLog counter.

References

[1] A Linear-time Probablistic Counting Algorithm for Database Applications 

HTML Parsers: The Journey to a More Semantic yet Forgiving Web

HTML5 is the next major revision of the html standard. If all works well, it should become the dominant markup in the nearest future ousting both html4 and xhtml1 from their cozy locations. A lot of people say HTML5 is the next big thing. In some sense, yes. But in another no. HTML5 isn’t another different markup language. It’s a specification that adds on to and removes some features from the already existing specifications for html4. It’s the next big thing in that it’s going to change the way we markup our html pages; it’ll add more meaning to elements making html pages more semantic. Apart from making the web more semantic html5 will also standardize a lot of features across major browsers. Finally, there’s going to be some elements that all browsers will implement and it would hopefully function the same way across these browsers. No browser will be left out including IE. Now, the ie6 death count down might even run faster. Check out the ie6 count down website at: http://www.ie6countdown.com/. Ok, that’s html5. What of xhtml1 and html4? Do they still exist and will they still exist? They still hang around and will for a while until all the browsers are standardized and old browser start to weather off.

All the html (and xhtml1) standards have parsers implemented in most non-trivial languages used frequently on the web to power web applications. There are xhtml1 and html4 parsers implemented in php, ruby, c++, and others. Most parsers use the libxml library in c to build and traverse the dom. It’s made for xml so the parser is very strict. The documentation and code for Libxml lives at: http://xmlsoft.org/. So libxml is appropriate for parsing xml but not for parsing the transitional versions of any html or xhtml. It’s not even appropriate for html5. HTML5 allows for some laxity on the side of the developer. That’s why there are parsers made specifically to parse HTML5 and no xhtml or html4 parser can appropriately parse HTML5. It’s different and COOL!!

HTML5 includes some new tags in its spec: <article>, <aside>, <header>, <section>, to mention a few. All these tags make html pages more descriptive, correspondingly making the web more semantic. These new tags will make development and deployment of web bots easier because now web bots can identify the different parts of a page and know what the data contained within the different page elements represent. They’ll now know if a writing is an article (stand-alone), if it’s just tangentially related to it’s surrounding content, if it’s a header section, and even how to outline the headers (using hgroup), and so on. For more information on the semantics of the new html5 tags and their use, please see Dive into HTML5. I think it’s a really practical and non-trivial guide to the new and emerging HTML5 specification. These new tags alone could throw the already existing parsers for html4 and xhtml off the edge. But there’s more complication to the work html5 parsers must handle. HTML5 is so FORGIVING! The <head>, <html>, and <body> tags which were required in the previous html specifications are now IMPLIED! That means that your web html5 page need not use these pivotal tags at all. You can have a page that looks like this:

<li>My boy is coming

The above markup is represented the same way in dom as this:
<html>
<head>
</head>
<body><li>My boy is coming</li>
</body>
</html>

THOSE PIVOTAL ALL-IMPORTANT CAN’T-DO-WITHOUT TAGS ARE NOW IMPLIED.
It seems like a mess but it’s not. Since every page will have to have these tags why not just help the author of the page define those tags as the page is read into the DOM? HTML5 parsers have to handle this situation. Soon, you’ll see how the HTML5 parsers handle these weirdo syntax.

HTML5 parsing has been detailed by the WHATWG group to help authors of HTML5 parsers build effective parsers. The group explains how to parse html5 documents. This document is very important to html parser authors and should be used progressively since the html5 specification is still changing (will not fixed for some time I assure you). This document is so detailed. It veers into the input stream and even the character encodings that the html5 parser should be able to handle. That’s why it didn’t take long for html5 parsers to spring up. One prominent suite of html5 parsers implemented in multiple languages is the HTML5lib. There are currently python, php, and ruby implementations available for download. Only the python version, though, is still being actively developed. The ruby version is dead while the php version is still in it’s alpha release state, I think. On the other hand, the html5lib has been ported to javascript/node.js. But this seems to be an event-driven parser. So it might be a SAX (Simple API for XHML) and not a DOM parser (which most people use). The SAX Model is more compatible with node.js since node.js is also event-driven. But event-driven parsers usually stop when malformed syntax is encountered. Errors discourage html authors who most of the time don’t know what the standard is (might change tomorrow). And there’s another port for java programmers (expected, right?); it’s called Validator.nu HTML parser and it contains SAX, DOM, XOM API’s (jack of all trades). There’s no port to functional languages like clojure yet. Awwww… Maybe I’ll port it to clojure.

You might ask: what about HTML5 validation? HTML5 validation isn’t really necessary anymore due to the most forgiving syntax of HTML5. What’s there to validate when your page does not even need a root tag?

Some time ago, I was playing with some HTML Parsers and comparing how these parsers handle malformed html syntax. My tests were entirely written in php. I fed some malformed syntax to DOMDocument, HTML5lib, and the php simple dom parser. The PHP simple dom parser is basically the DOMDocument PHP parser on steroids. The Simple dom parser allows for easy traversal of the dom. For example, suppose, that you want to find all the image elements in an HTML page. Using DOMDocument library in php, you would write something like document.getElementsByTagName("img"); which returns a NodeList of the image Nodes in the DOM representation of your just created html document. Using the simple html dom, you can do this: $html->find('img') where $html is the root of your document as in document.documentElement == $html

I cannot show all but some part of the whole rundown and results of tests I ran on the html parsers. I wouldn’t show you in exact terms either. I had some strings containing some well-formed html4-transitional, xhtml-transitional, html5 as well as malformed versions of the aforementioned markups.

$first = "<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>";
$second = "<li>Tell me all about ya!";
$third = "<body>
<p>I was with you</p>
</body>";

Then I ran these html markup through the html5 parser and the DOMDocument library in PHP like this:

$dom1 = new DOMDocument("1.0", "utf-8");
$dom1->formatOutput = true;
$dom1->loadXML($first); // did the same thing for $second and $third
echo $dom1->saveXML();
echo "\n";

The above code creates a DOMDocument in XML so the $first and $third html strings pass the test since they are both
valid xml but the $second is malformed. Complains that the <li> tag  isn’t matched by its closing tag.


$dom3 = DOMImplementation::createDocument(null, 'html',
DOMImplementation::createDocumentType("html",
"-//W3C//DTD XHTML 1.0 Transitional//EN",
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"));
$dom3->formatOutput = true;


$html = $dom3->documentElement;
// set up $html strings
$html->loadHTML($first); // did the same for $second and $third too.
echo $html->saveHTML();

The above snippet creates a document in xhtml1 transitional.
Results:
$first is valid xhtml1
$second — helps close the <li> tag
$third — valid xhtml1 transitional. Puts a <html>  tag but no <head>  tag present.


$dom2 = DOMImplementation::createDocument(null, 'html',
DOMImplementation::createDocumentType("html",
"-//W3C//DTD HTML 4.01 Transitional//EN",
"http://www.w3.org/TR/html4/loose.dtd"));
$dom2->formatOutput = true;


$html = $dom2->documentElement;
// set up $html strings
$html->loadHTML($first); // did the same for $second and $third too.
echo $html4_document->saveHTML();

The above snippet creates a document in html4 transitional.
Results:
$first is valid.
$second – since it’s transitional it just helps close the <li> tag
$third – puts the <html> (root) element in the dom but does not put the <head> element. Valid html4 transitional


$dom1 = HTML5_Parser::parse($first); // do the same for $
echo $dom1->saveHTML();
echo "\n";

The above snippet uses the HTML5 parser to parse the html strings: $first, $second, $third.
Results:
$first , $second, $third passes.
It closes the <li> tag as needed and puts in the <html>, <head>,<body> elements in the DOM when absent.
Nice!

The tests I ran were a more than this and more complex. I stripped some details to make the tests I ran easier to comprehend. Plus, I ran it on the command line. That’s why I use new lines to demarcate individual tests instead of <br> tags.

So we love html5. It’s forgiving. It’s modern. It might eventually replace flash. It’s already on our iphones and smart phones and is implemented in all recent versions of major browsers (including ie). We don’t need to validate our pages again because we know the built-in browser parsers won’t spew out errors (good or bad thing? You be the judge…). We can start using it right away even on older browsers (we can just use modernizr and HTML5shivs to detect if some html5 features are present in a browser). There are tools out there to help us handle old browser! Ain’t that great? We’ve already started our tortuous journey to a more semantic yet forgiving web!

Most Important Regular Expression for parsing HTML

The usefulness of Regexes (Regular Expressions) is ineffable. Especially in parsing documents, it’s a well-suited and indispensable tool. All good HTML and XML Parsers basically use Regexes to extract cardinal information in HTML documents like names of tags, whether the tag being examined is well-formed, empty, or even malformed by checking the tags against a bunch of rules. Of all the regular expressions needed by a HTML parser, the most important/complex is the one that matches the start tag of an element.

There are a lot of Regular Expression solutions on how to parse HTML tags and attributes. The most popular one (on the Internet that I’ve seen so far) is something like this: "<(\/?)(\w+)[^>]*(\/?)>." This is a non-greedy regular expression that matches both a start and an end tag. For example, it will match a "<pre>" and a "</pre>." It will also match a "<br/>" which is an Empty element. This Regex is so inefficient because it fails to consider a bunch of cases: How would the parser know if the tag is an empty, block, or inline element? How would it know if the tag has attributes and how would it handle these attributes? These questions cannot be answered by using this overly simplistic Regex.

I found a very efficient Regular Expression to parse HTML tags and attributes: /^<(\w+)((?:\s+\w+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>/. This Regex makes it easier to not only identify empty tags but also parse the attributes in these tags.This intelligent piece of Regex was in John Resig’s “HTMLparser.js”–a simple HTML parser made by John Resig. At first sight, it seems complicated. But if you look more closely into this Regex, you will notice it simply extracts necessary information about the tag: its tag name, its attributes, and its type (empty or not).

BREAK DOWN OF THE REGEX

There are three main groups:

  • The first group – (\w+) – captures the name of the tag being examined.
  • The second group is the largest of the three. It contains sub-groups, most of which are non-matching (groups with ?: are non-matching which means that information about the group wouldn’t be available after the group is matched). This second group matches the attributes in a tag (if there exists any). The group checks for quotes (double or single) around an attribute’s value. It also handles the case where there are  no quotes around the attribute’s value: [^>\s]+ . This also makes sure that the regular expression matching is non-greedy ([^>]* prevents capturing more than one right-angled bracket).
  • The third group is (\/?) and it simply checks if the tag being examined is an empty element (like <br/>’s and <hr/>’s).