Wednesday, August 26, 2009

Sliding Window

This is the third post in a series on Real-Time Search. The previous ones were introduction to Real-Time Search and RAM monster. In case you are also interested in a good article on some recent insights about search take a look at a discussion with SurfCanyon CEO.

One prominent feature of current real-time systems, that so far has not received much attention, is the sliding-window nature of their indexing. By this we mean the notion that only results within a specified fixed time interval are indexed and as new results appear, the old ones are discarded.

We strongly believe that this approach is fundamentally flawed as it is only able to return quality results within a short, fixed time-window and only for a limited fraction of content. There are many topics, events, discussions and quality results that fall outside of any fixed-size window. Indeed, we would argue that the span of most topics of interest fall outside any fixed-size window, and that as a result, the current approach to real-time search is essentially broken.

Preservation of older results is an essential feature of search engines in general, including real-time search. Of course, the ranking weight of older results decreases with time but their accessibility is still significant for many queries where there is a lack of quality fresh results or, more specifically, a lack of fresh results from any kind of fixed sliding-window.

There has not been much discussion about the relationship between general and real-time search. We believe there is a very strong, even formal relationship between them. In simple terms, a real-time search engine that preserves older results, with proper time-based biasing in rankings would, in time, start to converge to a general search system. The resulting general search engine would be of the highest quality, because of its unprecedented level of freshness.


Barry Engel said...

I'm seeing twitter results now in Google search. Are your suggesting Google is only keeping those results for a limited time?

Borislav Agapiev said...

No, I am not suggesting it for Google, however their coverage of Twitter is sporadic and they have to consider many other (legacy) things in their main search results. Even with Caffeine Google does not even came close to Twitter et al recency and coverage in real-time search IMHO.

I was principally referring to specialized real-tome search engines such as TweetMeme, OnRiot, Topsy, Collecta, CrowdEye etc.