Wednesday, December 9, 2009

Google Launches Real-Time Search

Google has just announced a real-time search offering which is a quite interesting development. I would like to make a few comments about some of the important issues in real-time search.

Ranking is a key issue in real-time search, the same way it is in general search. Some believe that one can produce a quality real-time search experience simply by filtering real-time information streams against keywords. The deficiencies of such approaches become clear very quickly -- there is so much noise that drowns out interesting results and there are many cases in which a slightly older result is much better than a slightly newer one and it should be ranked much higher.

One of the key issues in ranking is how to include user 'editorial' input in a scalable way, meaning how to include information about what people are looking at and scale it across the entire Web. Wowd uses EdgeRank, a distributed link-based algorithm that merges user inputs with power of link-based algorithms to provide across-the-web ranking. User input is critical in discovering new information as well as finding the highest quality results. The power of distributed link-based algorithms is needed to rank enormous datasets on the Web. EdgeRank accomplishes both.

In addition, there is the issue of what to do with older results. They can be very important, in cases where there are not enough new ones yet, or they should be displayed together with newer results, because of their quality. It becomes clear that older results should be retained for future use. In fact, a good real-time search engine that preserves older results starts converging toward a good general search engine. As a consequence, the resources required for real-time search are not lesser than for general search, in fact they are greater.

Wowd uses a completely scalable approach, where the resources in the system grow with the number of users. In contrast to centralized systems, where increased number of users degrades performance, in our distributed approach additional users increase performance, both in terms of additional resources as well as increased attention frontier. By attention frontier we mean a collection of what real people find interesting on the Web at any given moment. This data is very important, as Twitter has clearly demonstrated. People worldwide are clicking on tens of billions of links daily, and what's in Twitter is only a fraction of a percent of that.

There is another important aspect of our distributed approach. One might say that no one can touch the power of Google's 50+ distributed centers. But in a distributed system, the results are served from a neighborhood of nodes close to the requestor. As the network grows and number of nodes increases, the results are served from immediate neighbors that are probably closer to the requesting user than any data center could be.

We are looking forward to Google's entrance into real-time search as it is a step in the right direction, toward improving an important aspect of search that has been missing. We would like to analyze their offering and see what additional sources are being indexed in real-time, not only Twitter, and what fraction of their entire index is updated in real-time.

No comments: