Saturday, February 24, 2007

Limits of Search

Hello everyone, this blog is intended to be about search and exploring how far it can be pushed. I strongly believe that the ultimate limits of search lie in the distributed approach. I find it fascinating that distributed approach is pretty much neglected and not considered very seriously in modern search. There are, of course, noble exceptions but really only a few.

Since this is my first post on this blog, a little bit about me. My name is Borislav Agapiev and I am the founder of Vast.com, which is an online classifieds search engine in San Francisco. Go check it out if you are in market for a classifieds item or if you simply like browsing ...

I founded the company in 2000 in Portland, Oregon as Omni-Explorer Technologies. I worked on it for several years with my friends and local angel investors from Portland. In 2005 we showed our technology to Naval Ravikant, former founder and CEO of epinions.com. He liked our stuff so he joined the company (as CEO) and helped us close Series A financing and we moved from Portland to SF and the rest is (our) history -:)

Google, of course, these days is the standard of modern search. They truly are a great company and what they have built is pretty amazing.

But the question still remains whether it is even possible for a startup to compete with them. More and more very smart people are starting to say NO. For instance, Bill Burnham, Peter Norvig (Google Head of Research), Louis Monier (former CEO of Altavista, now at Google).

The essence of their argument is that the resources required are too prohibitive. In particular, one needs to have huge resources for :
  • crawling
  • query serving
  • indexing
This argument is of course true, to compete head-on with Google one needs to employ such resources in huge amounts.

I strongly believe the answer to the question is YES, it is possible to build a search engine which will not only be capable of competing with Google, but actually FAR surpassing it.

The trick is always not to compete with the established players on their turf and under their terms but to change both.

The key part of the modern search approach is that it is centralized, in a sense of centralized server farm(s). Within Google, GFS (Google File System) uses a distributed approach in connecting a sea of dumb linux machines by an internal network. The whole system is centralized in a sense that these machines are very tightly coupled through this network.

BTW if you are not familiar with GFS, I strongly urge you to check out GFS paper which is probably the best published account of how Google works.

In the following posts I will try to shed some light on how to build a truly scalable distributed search engine and show that its ultimate limits actually lie far beyond Google (yes, I know it sounds outrageous and out there -: ).

These thoughts are not just concepts and ideas, I am actually working on a new project to show how to do it. We are currently testing alpha and should have something to show pretty soon.

If you are interested or curious about search and its limits, have any comments, thoughts, ideas, musings or whatever feel free to contact me and join us in this journey ...


1 comment:

Ner said...

I'm currently writting a paper on Distributed Search and I found a website that addresses is distributedsearch.net, but I just don't understand how the client which is installed on a computer works... if you know anything about it, I would greatly appreciate a response as soon as you can. No pressure though...Thank you!