Mathematically URI=URL+URN, which means an entity can be identified either by its location(URL) or its name(URN). See the relevant Wikipedia page for details.
Every URI has the following syntax:
(scheme)://(authority)(/path)?(query)#(fragment)with query and fragment being optional. So in a URI like http://www.example.com/comment?id=1#1123, we have:
A very common task is when we have to parse a URI into its components. The most usual case is URL normalization, which is essential for search crawlers. During this process, a URI is trinsformed into an equivalent canonical form, such that 2 different canonical forms cannot refer to the same resource. For example the URL http://www.example.com and the URL http://www.example.com:80 , are different but refer to the same location on the net. This greatly helps spiders retrieving every time pages they have not visited before. You can find a good article about URL normalization here.
To parse a URI using a regular expression is really easy. In fact the method to do this is given by Tim Berner's Lee (photo) himself in the RFC for the URI standard. Using Perl, if $uri has the URI then by applying
$uri =~ /^(([^:\/\?#]+):)?(\/\/([^\/\?#]*))?([^\?#]*)(\?([^#]*))?(#(.*))?/;
That's it! All languages (C#, Java etc) has an almost identical syntax and it is straightforward to port it there given that $n refers to the n-th group.