Sunday, October 12, 2008

Distributed Search Cloud

First let me say a few words about the long hiatus since the last post - the main reason is that a lot happened since then, in particular a lot related to my vision of distributed search :)

I am working on a new startup with a clear mission of advancing distributed search as well as leveraging user inputs as much as possible, but in a massively scalable way.

We have assembled a great core team of really motivated and enthusiastic people and we are also very happy that our investors share our passions for distributed computing and search.

We are still in stealth so I cannot say much about the details yet but I can promise you that pretty soon (within months) you will be able to see what we are up to. In the mean time, feel free to drop me an email at boris-at-edgios-dot-com if you would like to be notified of news, also if you have some comments, ideas and suggestions or are simply interested in a discussion.

In the following posts I will try to shed some light on the state of the art in search, including some perceptions that are not very well grounded in reality :)

Today, let us start with some facts about the scope of computing resources on (hundreds of) millions of user desktops and how these resources relate to the problem of search, which is one of the most complex problems on the Internet today.

Note that the size of the entire Web in terms of storage is not that large - assuming 10kB/page of indexable information, stripped of unnecessary stuff ( HTML formatting etc) , possibly compressed, and using 40B pages (Google index) would result in 400 TB, which is not very much at all. One can fit 400 TB in 100 1U servers fitting in three racks.

Note that such a corpus, together with an accompanying index, would constitute a single search cluster capable of covering the entire Internet but able to serve only dozens or so qps (queries per second). The massive resource increase comes from the need to replicate such a cluster hundreds of times to support query loads of thousands of qps with peaks over 10K (what Google serves).

Note that this replication is not uniform, you will have to replicate popular stuff much more than the rest, and you better be right about the popularity distribution of the queries or else your system will be brought to its knees or if you are really rich, you will have lots of machines sitting there doing nothing.

Problems at Cuil launch were IMHO due to under-provisioning expected query load at launch, both statically (total number of queries) as well as dynamically (not knowing which queries are most popular)

There are several classes of resources available in abundance on user desktops:
  • Bandwidth - with wide adoption of broadband the aggregated bandwidth out there is mind-boggling. For instance, consider crawling,a rough rule of thumb is that one can crawl 1 million pages a day with 1Mbps of bandwidth. That means that a group of, say 100,000 motivated users could theoretically crawl 100 billion pages a day - more than twice the size of Google's index. One might argue that 1Mbps is a lot to ask for crawling, but think how much bandwidth is consumed every night on prime time cable TV hawking the latest (extremely important :) ) entertainment wares (16Mbps+ for digital TV). The rise of YouTube and video on the Internet is also driving rapid increases in bandwidth.
  • RAM - the amount of the aggregate memory is truly staggering. The above group of 100,000 users running an application using 100MB of RAM (not small but also not a big deal these days, thank you MS Vista :) ) would have 10 TB of RAM !! Hint - you can fit a BIIIG index in there :) It is also very handy if you want to make that fresh and incremental ...
  • Disk - assuming 10GB/user, again not a big deal these days when media downloading is ubiquitous, the total storage would be 1000 TB - 1 Petabyte, enough for more than two copies of the entire Web.
  • CPU - 100,000 CPUs and many more cores bring a lot of compute cycles to the table.

So a natural question arises if we could do search effectively with such a cloud, and if so how good could it be?

In short the answer to the first question is YES, and to the second - VERY GOOD. These two answers are, in essence, what my latest startup is about :)

The number 100,000 is chosen simply for illustration, there is nothing special about it. In fact, such a group of users is today not considered very large at all. On the other hand, just try to imagine a cloud with million or ten million users. Its capabilities would be staggering.

Our vision is to build such a cloud and I invite you to join us in this journey.