Friday, September 5, 2008

A One-Line Code on How To Parse a URI

Uniform Resource Identifiers(URI) is perhaps synonymous to the Internet itself. As the name implies it is used to identify something on the net. Typically a URI is a different than a URL(Uniform Resource Location) but in most cases we use them interchangeably.

Mathematically URI=URL+URN, which means an entity can be identified either by its location(URL) or its name(URN). See the relevant Wikipedia page for details.

Every URI has the following syntax:
with query and fragment being optional. So in a URI like, we have:


A very common task is when we have to parse a URI into its components. The most usual case is URL normalization, which is essential for search crawlers. During this process, a URI is trinsformed into an equivalent canonical form, such that 2 different canonical forms cannot refer to the same resource. For example the URL and the URL , are different but refer to the same location on the net. This greatly helps spiders retrieving every time pages they have not visited before. You can find a good article about URL normalization here.

To parse a URI using a regular expression is really easy. In fact the method to do this is given by Tim Berner's Lee (photo) himself in the RFC for the URI standard. Using Perl, if $uri has the URI then by applying
$uri =~ /^(([^:\/\?#]+):)?(\/\/([^\/\?#]*))?([^\?#]*)(\?([^#]*))?(#(.*))?/;


That's it! All languages (C#, Java etc) has an almost identical syntax and it is straightforward to port it there given that $n refers to the n-th group.