Friday, September 5, 2008

A One-Line Code on How To Parse a URI

Uniform Resource Identifiers(URI) is perhaps synonymous to the Internet itself. As the name implies it is used to identify something on the net. Typically a URI is a different than a URL(Uniform Resource Location) but in most cases we use them interchangeably.

Mathematically URI=URL+URN, which means an entity can be identified either by its location(URL) or its name(URN). See the relevant Wikipedia page for details.

Every URI has the following syntax:
(scheme)://(authority)(/path)?(query)#(fragment)
with query and fragment being optional. So in a URI like http://www.example.com/comment?id=1#1123, we have:

scheme="http"
authority="www.example.com"
path="/comment"
query="id=1"
fragment="1123"

A very common task is when we have to parse a URI into its components. The most usual case is URL normalization, which is essential for search crawlers. During this process, a URI is trinsformed into an equivalent canonical form, such that 2 different canonical forms cannot refer to the same resource. For example the URL http://www.example.com and the URL http://www.example.com:80 , are different but refer to the same location on the net. This greatly helps spiders retrieving every time pages they have not visited before. You can find a good article about URL normalization here.

To parse a URI using a regular expression is really easy. In fact the method to do this is given by Tim Berner's Lee (photo) himself in the RFC for the URI standard. Using Perl, if $uri has the URI then by applying
$uri =~ /^(([^:\/\?#]+):)?(\/\/([^\/\?#]*))?([^\?#]*)(\?([^#]*))?(#(.*))?/;

Then:
scheme=$2
authority=$4
path=$5
query=$7
fragment=$9

That's it! All languages (C#, Java etc) has an almost identical syntax and it is straightforward to port it there given that $n refers to the n-th group.