Most Important Regular Expression for parsing HTML

The usefulness of Regexes (Regular Expressions) is ineffable. Especially in parsing documents, it’s a well-suited and indispensable tool. All good HTML and XML Parsers basically use Regexes to extract cardinal information in HTML documents like names of tags, whether the tag being examined is well-formed, empty, or even malformed by checking the tags against a bunch of rules. Of all the regular expressions needed by a HTML parser, the most important/complex is the one that matches the start tag of an element.

There are a lot of Regular Expression solutions on how to parse HTML tags and attributes. The most popular one (on the Internet that I’ve seen so far) is something like this: "<(\/?)(\w+)[^>]*(\/?)>." This is a non-greedy regular expression that matches both a start and an end tag. For example, it will match a "<pre>" and a "</pre>." It will also match a "<br/>" which is an Empty element. This Regex is so inefficient because it fails to consider a bunch of cases: How would the parser know if the tag is an empty, block, or inline element? How would it know if the tag has attributes and how would it handle these attributes? These questions cannot be answered by using this overly simplistic Regex.

I found a very efficient Regular Expression to parse HTML tags and attributes: /^<(\w+)((?:\s+\w+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>/. This Regex makes it easier to not only identify empty tags but also parse the attributes in these tags.This intelligent piece of Regex was in John Resig’s “HTMLparser.js”–a simple HTML parser made by John Resig. At first sight, it seems complicated. But if you look more closely into this Regex, you will notice it simply extracts necessary information about the tag: its tag name, its attributes, and its type (empty or not).

BREAK DOWN OF THE REGEX

There are three main groups:

  • The first group – (\w+) – captures the name of the tag being examined.
  • The second group is the largest of the three. It contains sub-groups, most of which are non-matching (groups with ?: are non-matching which means that information about the group wouldn’t be available after the group is matched). This second group matches the attributes in a tag (if there exists any). The group checks for quotes (double or single) around an attribute’s value. It also handles the case where there are  no quotes around the attribute’s value: [^>\s]+ . This also makes sure that the regular expression matching is non-greedy ([^>]* prevents capturing more than one right-angled bracket).
  • The third group is (\/?) and it simply checks if the tag being examined is an empty element (like <br/>’s and <hr/>’s).

Why Vidmuster? How to Use It to Get More Information about Videos

Why did I create Vidmuster.com?

I use YouTube a lot. There are so many times I wished there was a way I could get more information about a particular video. I frequently went to Wikipedia or google to find out more about the video. Before I knew Lady Gaga, for instance, I would after watching “Alejandro” google why she sang the song and under what context. As I more frequently did this, I decided I could just create a website that sums all the information about a video and presents it easily to the user. That’s how vidmuster.com was born (mustering videos alongside information).

I created vidmuster.com mostly for educational reasons and absolutely not for any pecuniary reasons. The construction of the site served as a way to familiarize myself with various technologies including but not limited to: HTML5, CSS2 & 3, advanced JavaScript, advanced jQuery (and extensive use of plugins), including the creation of some custom jQuery plug-ins to power some features (like the Accordion video menu the user can use to manage his playlists).  It, furthermore, serves as a way for me (and others who are working on the site or who will work on the site in the future) to practice some software development & planning as we update/improve and even revamp the site; in the process we would, inadvertently, become pros. in our various fields able to face technological problems on the horizon. Ok, enough!

Succinct list of features on the Site:

  • YouTube instant Search: This is not the main feature. It’s already been implemented by feross.
  • News and Tweets instant Search parallel to YouTube instant search: The app. gets info about a video (the currently searched term) from freebase, google news, and twitter via their REST API’s.
  • Playlist Management: I felt video playlist management goes hand in hand with video viewing because a user would want to, occasionally, save his searches and go back to it or share it.
  • Sharing Videos: The user has to be logged in to share videos and/or play-lists. The user could share the videos (individual videos) on his/her play-list by clicking on the share button next to the video thumb in the play-list Accordion area. He or her could share videos with friends via Facebook, twitter, or google buzz.
  • Sharing Playlists: The user could share whole play-lists by correspondingly sharing unique SEO-friendly urls via Facebook, twitter, or google buzz or just manually (as in copying and pasting in emails, in forums, while chatting, et al.)
  • Facebook OAuth 2.0 (Authentication): The user is able to login via Facebook. Easy. Fast. Reliable enough. We only use your email to as does Facebook, uniquely identify the user. Nothing else. We don’t care about what friends you have. And we don’t plan to send you a confirmation email on registration.
  • And other features I can’t readily remember or that is either not that important or somewhat inconspicuous.

How to use some Features

How to use the Instant Search Feature

To search for videos on the site. Just easily use the search bar. Use the links in the freebase, google news, and twitter links to refer back to the original full articles/tweets concerning the currently Searched term.

How to use the Playlists

A major part of the app lies in the playlist functionality. It enables the user save some videos for reference sake or that he wants to share with friends. Just drag and drop a thumb in the Search Trunk (each thumb represents a video) to a playlist you’ve made on the right of the Trunk. That’s it! It would be added to the specified playlist. Should you want to delete a video in the playlist, just press the button, delete. To add or delete whole playlists, use the Play-list lightbox. Click on the manage button on the header bar. You would see a lightbox in which you can add/delete/share playlists.

At the moment, there isn’t too much functionality. But we’re growing and would continue to maintain the site, adding more functionality with time and if necessary and useful, cleaning up any redundant features, beautifying the site, and securing it the more.

Please support this site by Visiting it. Here are some interesting things I found out about pop culture since I started using the site:

  • Lady Gaga wore a meat dress (a lot of people already saw it; but I didn’t). Check it out now and see what others are tweeting about Lady gaga or her meat dress: Lady Gaga at Vidmuster.
  • Eminem has a total of 13 Grammy awards and he is the first rapper to win best rap album for three consecutive LP’s.
  • Rihanna is Barbadian. Maybe a lot of people already knew this but I didn’t.
  • And many more introductory, fun, intriguing, and sometimes pesky news.

Thanks.

“Everyone’s got an API. I want mine!”

First of all, for the record, I didn’t say “Everyone’s got an API. I want mine!”; it’s just the title of this post. Secondly, that is the wrong frame of mine for anyone to have in anything because it’s not good to go into creating/building something if you are primarily motivated to be “second best.” I shouldn’t divert from the main topic here. I’m writing to shed some light on some API business. I’m mainly concerned about the REST API Web Service because it is by far the most popular and useful Web Service out there. This Web service is very popular among JavaScript programmers because JSON is one of the formats by which REST API’s transfer/relay state. SOAP does not support JSON!

Some people might have heard about the SOAP and REST Web Services but might be puzzled on what services these two really provide and which to choose to solve some specific problems.  SOAP stands for Simple Object Access Protocol while REST stands for Representational State Transfer. So that I don’t bore you with so many technical details as would some others, I would say that the main pragmatic difference between these two is that REST as a service ships in more versatile formats like JSON, XML, YAML (sometimes) while XML is the message format that SOAP ships in and depends on. Applications that implement a RESTful architecture are usually said to be RESTful.

Who uses REST and why do people use it? A lot of companies! All of Yahoo’s web services use REST, including Flickr. Del.icio.us API uses it, pubsub, bloglines, technorati, and both eBay, and Amazon have web services for REST. Here are the defining qualities that make REST stand out:

  • Lightweight – excludes much of the redundancy that accompanies SOAP.
  • Easy to build – no special toolkits required.

Now let’s get to the real business: How do you build a REST API (I hope you are convinced enough that REST is easier to produce and use than SOAP). I could write a bunch of pages on how to write a REST API using WSDL (Web Services Description Language) or a server-side language like PHP or Ruby but would be reinventing the wheel by doing so. I have a list of tutorials that I believe people can benefit from just as I did and learn how to create their own REST API’s. Please remember that when following these tutorials, you should take note of third-party libraries like SoapServer in PHP. It is very tedious, slow, and not wise to start building your own library from ground up without using any helper libraries. Take a look at this list of resources on API creation and development:

Ok, that’s it folks! Follow more than one of these tutorials (make sure you listen to the google tech video; it’s priceless!).