Thursday, September 25, 2008

Donald Knuth And The Complexity Of Songs

Donald Knuth is a living legend among computer scientists. His monumental work-of-life "The Art Of Computer Programming" (I was never able to fully follow it) is a standard reference/textbook/work of art for computer science. Mr. Knuth took a new, immature geek-only hobby and transformed it into a solid and complete scientific discipline.

Among his contributions, the systematic analysis of the complexity of algorithms, really stands out. As a concept complexity is very familiar to developers. It refers to how a specific algorithm or algorithmic solution to a problem scale to the 'problem size'.

If his work is a reason to admire him, his humor is a reason to love him. Recently, I had the most splendid time reading his hilarious paper titled "The Complexity Of Songs" ! Here is the story.

A song has some lyrics which we have to remember in order to sing it. Humans are trying to learn lyrics of length (space complexity) to sing a song of length n. You would expect mathematically that s ~ n, but Knuth investigates all human 'inventions' to reduce the song space complexity!!

After a series of 'theorems', Knuth proves the existence of songs with s= a.n, a<1, s="O(sqrt(n))," s="O(logn)">and finally.....s=O(1)!! Yes, you read well. 

Knuth argues the first step was the invention of the refrain, the repeated part of a song. If the song has m verses of length and the refrain has length R, then the song length is roughly n = V.m+R.m and the complexity s = m.V+R. Hence we have a reduction of V/V+R in song complexity. So, a refrain is also a tool to save some memory space to remember the song!

Remember the old song "Old Mc'Donald had a farm, Ei-gh,Ei-gh,Oh!" ? Well this has a complexity s = O(sqrt(n)). It is much like the Greek song "Οταν θα πας κυρα μου στο παζαρι...". In this pattern, each verse includes all previous verses. For verses n = o(m^2) and s = O(m), so s = O(sqrt(n)). This means that it is much easier to remember! That's why they are most suitable for children! A same pattern can yield a log complexity.

Now to the best part, Theorem 2 is the real killer! The introductory text goes as follows:"However, the advent of modern drugs has led to demands for still less memory, and the ultimate improvement of Theorem 1 has consequently just been announced:

Theorem 2 (Donald Knuth)
There exist arbitrarily long song of complexity O(1)

PROOF. Define U = 'uh huh','uh huh' and the k-th verse Vk = 'That's the way', U, 'I like it', U . This is a constant V. Then the song V^k, completes our proof!! Oh dear!

This last one really cracked me up!!!   :)

Friday, September 5, 2008

A One-Line Code on How To Parse a URI

Uniform Resource Identifiers(URI) is perhaps synonymous to the Internet itself. As the name implies it is used to identify something on the net. Typically a URI is a different than a URL(Uniform Resource Location) but in most cases we use them interchangeably.

Mathematically URI=URL+URN, which means an entity can be identified either by its location(URL) or its name(URN). See the relevant Wikipedia page for details.

Every URI has the following syntax:
(scheme)://(authority)(/path)?(query)#(fragment)
with query and fragment being optional. So in a URI like http://www.example.com/comment?id=1#1123, we have:

scheme="http"
authority="www.example.com"
path="/comment"
query="id=1"
fragment="1123"

A very common task is when we have to parse a URI into its components. The most usual case is URL normalization, which is essential for search crawlers. During this process, a URI is trinsformed into an equivalent canonical form, such that 2 different canonical forms cannot refer to the same resource. For example the URL http://www.example.com and the URL http://www.example.com:80 , are different but refer to the same location on the net. This greatly helps spiders retrieving every time pages they have not visited before. You can find a good article about URL normalization here.

To parse a URI using a regular expression is really easy. In fact the method to do this is given by Tim Berner's Lee (photo) himself in the RFC for the URI standard. Using Perl, if $uri has the URI then by applying
$uri =~ /^(([^:\/\?#]+):)?(\/\/([^\/\?#]*))?([^\?#]*)(\?([^#]*))?(#(.*))?/;

Then:
scheme=$2
authority=$4
path=$5
query=$7
fragment=$9

That's it! All languages (C#, Java etc) has an almost identical syntax and it is straightforward to port it there given that $n refers to the n-th group.

Tuesday, July 15, 2008

Wi-Not Mobile: A Different Pocket PC Freeware Application


[This article discusses a new freeware application for Windows Mobile devices, in the creation of which I feel excited to have participated. It is called Wi-Not Mobile and can be found in this site]

You may have noticed for some time the link on the right. It points to a new mobile platform we built, a limited number of talented professionals and me. Starting with silly little ideas, it quickly grew to integrate some research results of our personal projects. The result was Wi-Not Mobile, a mobile platform for information, communication and entertainment. Wi-Not Mobile is a freeware pocket pc application and you can find it here. It is on air for a week now and with limited or now advertising has far exceeded 1000 registered users and many more downloads. Here is a small tribute to this one-year full time effort and some technical details for those who are interested.

For a quick introduction here is a Youtube video, which tries to describe Wi-Not Mobile in 4 minutes:



We have received so much feedback, both good and bad and I feel really intrigued in trying to understand how others view this application and we are so enthusiastic we really see criticism as an opportunity. It may sound typical but it is true. Here is my understanding.

First of all, Wi-Not mobile is not an application. It is a platform and there is a perfect reason for that. The platform is a more general concept and solves much more difficult problems than an application. Wi-Not's platform defines a basic API which we later use to create all different functions that exist inside. Let's iterate through the various features to show how.

Wi-Not Mobile offers free web TV streams for your Pocket PC, but also the client to consume it. If we were limited to this (like tons of other popular TV playing programs) then Wi-Not would be an application. But we are not. TV streams in Wi-Not Mobile are updated automatically whenever there is an available new version of the TV channels. The same holds for the web radio streams found in Wi-Not Mobile. This alone does make an excuse for the term 'platform' but the story is not over yet.

The Instant Messaging module of Wi-Not is a clean and efficient way to exchange messages. Of course it was not built to compete with already popular mobile IM clients (MSN Mobile and others). But there is something special inside that is actually a function that the underlying platform can support. It enables for automatic translation. For example, a user can define that he writes in Spanish, his friend define that he reads in Italian, and still be able to communicate in their native languages, a feature that at least for now cannot be found in any other IM mobile client.

Moving on to entertainment portion, we arrive to what we tenderly call "Wi-Lol". There is another trick here. What it can do? You simply type the game you want to play, and the module (using the platform's middeware) will locate free and available flash games to play in your mobile. The difference with other similar functions is that we do not store a single byte of game content. So how is it done? The platform defines a search API, which can have many "flavors" (inheritance in programming words), and these can be a web search, game search, or a music search. Each of these kinds share some common properties (base class) but in the meantime have many differences. The game search module of the Wi-Not locates flash files and reports back available links. With the embedded flash player it is possible to play it on the fly!

There is also a great option for getting music on your pocket pc. We have put there a special media finder, which can accept keywords for your search and locate audio files. This is entirely new for a mobile application. Again, we do not store a single byte of audio content. With a hit rate of around 90% you will able to get the mp3 file you were looking for. Again the platform here plays the most crucial role. It is able to maintain places that are more probable to provide with an answer and also to dynamically find more.

I could really talk for hours about this stuff but I am sure it will bore most of the people. There are certainly many more posts coming about this effort. We are really excited about the outcome and ready to do more. There are also many hard lessons from releasing into public your creations and perhaps I should talk about it later.

Till then you are more than welcome to download Wi-Not Mobile from here and I would be grateful if you could provide with your thoughts about it. Have fun!

Monday, June 2, 2008

Catalan Numbers, Dyck Words and Robots Climbing Stairs

Repeating is the very nature of recursion, so this article continues over from the previous post over recursion and stack usage. If you don't remember the initial problem here it is again:

"Imagine a robot(say Asimo) that climbs up stairs and has one very simple function step() which succeeds with a probability p and climbs up one stair, and fails with a probability q and climbs down one stair, with p>q and p+q=1. Write a function step_up() guaranteeing that the robot will climb up exactly one stair."

It is easy to create a quick and dirty recursive solution:

void step_up() {

if(!step())
{

step_up();
step_up();

}

}

While this solution is not the best (a better one is provided also in the previous post with the while version), it is most interesting to calculate the number of times that step_up() will be called, in other words to calculate how large the stack size will be.

There were two solutions available: To calculate it analytically or use a recursive technique to do it. With the recursive technique we can set: If X is the number of average calls to step_up() then X = p.1+q.(1+2.X) which yields X = 1 / (p-q) . Both easy and really smart!

However, going analytically is much more difficult but it is most amusing! As I said in the previous post, going the analytical way would mean trouble. Trying it myself I left it unfinished when facing a difficult combinatorics sub-problem. By a really weird coincidence, today I stumbled upon a number of interesting combinatorics tools which could help solve it analytically once and for all!. Here is how it goes.

First, it is clear that is impossible that the step_up() function will be executed an even number of times, for each failed execution two more executions occur. We should wonder: What is the probability that the function will enter 1 time? Clearly it is p. What is the probability of entering 3 times? This would mean failing the first time and succeeding the other 2 times, a total probability of q.p.p = q.p^2. If we want to calculate how many times the function will be executed 5 times we will have to count all possible 'combinations'. We will follow the encoding X1-X2-X3-...-Xn , if we would like to denote a case where the function is executed n times where Xi is either F or S, meaning failure or success at step i respectively.

With this notation we have:
step_up() enters 1 time: Cases: S Combinations:1
step_up() enters 3 times: Cases: F-S-S Combinations:1
step_up() enters 5 times: Cases: F-F-S-S-S, F-S-F-S-S Combinations: 2
step_up() enters 7 times:
Cases: F-F-F-S-S-S-S, F-F-S-F-S-S-S, F-F-S-S-F-S-S, F-S-F-F-S-S-S, F-S-F-S-F-S-S
Combinations: 5

Stop here. There is one important property of every X-X-X-X.. string with n elements, which is that for any initial substring the number of S's is not greater than of the F's. Why? Because if this was the case , the robot would climb up one stair sooner than n steps! Cool! Now, one more thing: It is easy to establish that the last element of the every string will be S, which actually means that the last step of the robot will be to the upwards direction. Ok, this is clear. So let's formulate again the problem.

We have a string with two possible letters, F and S, of length 2.n, in which we have n F's and n S's and for every initial substring the number of S's does not exceed the number of F's. Now we want to calculate the number of all valid combinations for this case. But how to do that? Well, here comes the so-called Dyck words because what we described previously is what exactly how a Dyck word is defined! Now, there is this theorem that states that there exactly Cn Dyck words of length 2.n. What is Cn? But the Catalan numbers of course!

What are the Catalan numbers?? There are defined with the following equation
C_n = \frac{1}{n+1}{2n\choose n} = \frac{(2n)!}{(n+1)!\,n!} \qquad\mbox{ for }n\ge 0.
and they occur very often in combinatorics problems. In our problem they calculate all possible 'configurations' in which the step_up() function will be called exactly (2.n+1) times (Remember that the last step is always up, this means the last call is always successful). This means that the average calls of step_up() is calculated by, Now the generating function of the Catalan numbers in the following equation
c(x)=\sum_{n=0}^\infty C_n x^n.
is given by the equation
c(x) = \frac{1-\sqrt{1-4x}}{2x}
Using these equations in combination with the derivative of c(x) for the case with the n coefficient it is easy to arrive to to X = 1 / (p-q)
Wow! Recursion, probabilities, encoding, Dyck words, Catalan numbers, generating functions, all combined to arrive analytically to the complexity of the recursive function step_up()!! Now, what you will prefer, depends on you: Either an elegant and smart solution or a rigorous, brute but educating analytical approach! Your call!

Tuesday, April 15, 2008

"Time Is What Prevents Everything From Happening At Once.." - John Wheeler (1911-2008)

The quote continues: "Space is what prevents everything from happening to me!"
John Wheeler, has for a long time, been studying advanced concepts of theoretical physics, which for the most people are somewhat taken for granted, space,time, gravity and so on. He passed away in Sunday, 13th of April 2008, at the age of 97.

John Wheeler was one among the last of the 'big school' of theoretical physics. He has worked with legendary figures such as Niels Bohr, Oppenheimer and Albert Einstein(related photo) and he is the inventor of the popular term 'black hole', this strange 'creature' that is so big that even its own image cannot evade from its infinite gravity (visible light cannot reflect or be emitted) At some point in 1941 Wheeler joined the famous (or infamous) Manhattan project. Among his graduate students as a professor in Princeton, were Richard Feynman and Kip Thorne.

I suppose, everybody from a CS background has developed at some stage of his life a certain interest for astronomy and physics. I, my self, at a young age was a fan of such fields along with science fiction. I never actually understood what was going on with these strange quantum equations but I think I find my self equally astonished when confronted with concepts like 'black holes' or 'time bending'. Astonished at least.

You can find lots more information in the official press release from the Princeton University site.

Sunday, April 6, 2008

Why The Array Index Should Start From 0

I recently came across an article from Dijkstra(photo) explaining a fact that looked obvious to me before: Why should start indexing an array from 0. I was amused to find out that there is some clever thinking behind this.

All popular languages, like C/C++, Java or Perl start indexing an array from 0 while the last index is the array length minus 1. While this is usual to most developers, it is not a de facto situation for all programming languages. For example in Fortran, when you declare an array with 'integer a(10)', aka an int array having 10 elements, the index starts from 1 and ends at 10 (however you can override this behavior using a statement like, integer a(0:9), declaring an array with indices from 0 to 9, but anyway).

The discussion over why the array index should start at zero is not a trivial one and relates to interesting concepts from computer science. First of all, it has strong relation to language design. For example in C, the name of an array is essentially a pointer, a reference to a memory location, and so the expression array[n] refers to a memory location n-elements away from the starting element. This means that the index is used as an offset. The first element of the array is exactly contained in the memory location that array refers (0 elements away), so it should be denoted as array[0]. Most programming languages have been designed this way, so indexing from 0 is pretty much inherent to the language.

However, Dijkstra explains why we should index from 0. This is a problem on how to denote a subsequence of natural numbers, say for example 1,2,3,...,10. We have four solutions available:
a. 0<i<11
b. 1
<=i<11
c. 0
<i<=10
d. 1
<=i<=10

Dijkstra argues that the proper notation should be able to denote naturally the two following cases:
1. The subsequence includes the smallest natural number, 0
2. The subsequence is empty


Requirement 1. leaves out a. and c. since they would have the form -1<i which uses a number not lying in the natural number set (Dijkstra says this is ugly). So we are left with b. and d. Now requirement 2. leaves out d. since for a set including 0 that is shrunk to the empty one, d. takes the form 0<=i<=-1, which is a little...well, messed up! Subtracting the ranges in b. we also get the sequence length, which is another plus. Hence we are left with b. which is by far the most widely used notation in programming now.

Now you know. So, remember and take pride in the fact that each time you write something like

for(i=0;i
<N;i++)
{
sum+= a[i];
}
you are not just following the rules of language notation. You are also promoting mathematical beauty!

Thursday, April 3, 2008

Perl by Example: English Dictionary In 22 Lines

[Update #1] This post has been cited from Princeton's WordNet Project as a nice Perl extension. You can find lots of more of information about WordNet here.

[Update #2] Many thanks to my brother Costas, for buying me the ultimate Perl book: "Programming Perl, 3rd edition" by Larry Wall. Thanx bro! :)

WordNet is an online English dictionary from the Princeton University. It has some hundreds of thousands of words and it keeps growing. You have absolute access to the resource and there are interfaces in practically all programming languages. It is not tough to build one from your own, but I ceize the opportunity to make an introduction to a really special programming language: Perl. Of course a crappy blog post cannot introduce a such extensive language as Perl, but additional Perl articles are sure to follow!

For the history, Perl was designed as a so-called 'glue' language for Unix, simply for doing things that with other tools were somewhat difficult. Created in the mid-80's by Larry Wall, in a one-man project as a fast reporting tool, it is now spread to all machines and applications. I have never used Perl for large GUI-demanding projects, but it has been a killer for simple but heavy tasks.

Whatever the objections, you may find Perl great for one simple but rare thing: Perl has culture. Its creator and all people actively involved in Perl's development share a number of values: Freedom, innovation and a unique sense of humor. To explain, I have two Perl books in my library: "Learning Perl" and "Programming Perl", affectionately called by the Perl community as the "Alpaca Book" and the "Camel Book" for the animals on the cover (Camel is the Perl mascot as shown in the figure). For me they are the most easy-read programming books that I have read. Anything else I use as a reference because I find programming books too hard to follow for long. But these differ. The style of writing and the humor that they transmit make you read them in literally one breath.

Anyway, the best way to introduce a language is by example, so here is mine for Perl: To build an English dictionary based on Princeton's WordNet. Believe it or not, all in all it takes 22 lines to do it! You may see the interface with the program in the following video. When you are done you might want to check the source code with some comments for better understanding.


The Perl source code follows:

my $word; #declares a scalar variable.Perl has two basic types: scalars and plurals(lists,hashes)
our $message = "
\nPlease enter a word to look for:\n>";
print $message;
#prints our message
while($word = <>) #reads the word the user searches for
{
chomp($word);
#remove last character (the '\n', new line)
if($word eq "<") { system("cls"); #if user types "<" clear screen
print $message; }
else {
print "Connecting..";
my $url = "http://wordnet.princeton.edu/perl/webwn"."?"."s=".$word;
#WordNet URL
`wget -q -Oindex.html $url`; #download the results
open FILE,"<","index.html"; #open the result web page locally
print "\rOk.Results:";
while(<FILE>)
#for every line in file
{
if(/<\s*b>$word<\/b>[^\(]+\(([^\)]*)\)/)
#match a pattern, don't be scared!
{ print "\n#def#$1"; } #..and print the word definition !
}
close FILE;
#close file
print "\n>";
}
}



What might scare you off is the pattern matching line, especially if you have never come across regular expressions before. The logic is this: In WordNet result page, each word definition comes with original word in bold and the explanation in parentheses. For example if we search for 'developer' it may have the following format:
...developer...(a man who develops)...
...developer...(sometimes referring people creating software)...and so on
What the pattern matching does it to match if the line has the word we search for in bold and then extracts the sentence in parentheses.
When it comes to pattern matching, Perl is the king.

That's all. If you already have a Perl distribution, you can test it yourself. If you don't and you would like to give it try, for Windows there mainly two distributions. The one is from ActiveState and you can download a Windows Perl distribution named ActivePerl here (just click the download button). The other is called Strawberry Perl, it recently came out of beta and you can find it here. I haven't tried but it may well be good.

Needless to say, Perl is free and always will be! So ride the camel and enjoy!