Friday, November 19, 2010

Compressing the web

I just had a very pleasant time reading Frederic Filloux 's post on his wish list for a future word processor. In a nutshell, what he is looking for is a context-aware spell checker, so that it does not need to constantly tell the dictionary about new words and surnames. It goes even farther : citing a famous example from Google's Wave, the sentence "Icland is an icland" get automatically corrected into "Iceland is an island", the most probable spelling.

We can easily get a glance of this feature by just mis-spelling a word into Google search engine : although the search is performed, a correction is also automatically proposed.

Now just a word on how this correction is proposed : this is no longer about reference into dictionary. No, Google has enough memory to download (almost) the whole web, study it, and get meaningful statistics out of it. Its engine is not instructed with English lessons, it learns it by itself, by just continuously reading English texts. And it does the same for any other language.
This does not even stop there, it can guess equivalence, correct spelling and grammatical rules out of it. Moreover, it can extract from the current context enough information to propose the proper correction out of several possibilities.

So where does that lead us ?
Guessing the proper next sequence from the current context and from what has been learned with past history, doesn't that sound like compression ?

It must be underlined here that "past history" refers to "the web", at large. Therefore, digging into such a massive amount of data and still generating an answer in a fraction of a second is quite a feat, and should inspire compression algorithms (or the other way round).

As stated in a previous post, deeply looking into a dictionary of size N requires 4xN or 8xN memory. Obviously you can't do that on a dictionary which is "Internet" size, so you don't even think of making a search. Creating a statistical structure out of this huge source is likely to result in much more compact dataset, and as a consequence usable & fast search algorithm.

Looking for a compression algorithm working with stats rather than dictionary ? look no further :  think about PPM / CCM.

No comments:

Post a Comment