Wednesday, August 26, 2009

Sliding Window

This is the third post in a series on Real-Time Search. The previous ones were introduction to Real-Time Search and RAM monster. In case you are also interested in a good article on some recent insights about search take a look at a discussion with SurfCanyon CEO.

One prominent feature of current real-time systems, that so far has not received much attention, is the sliding-window nature of their indexing. By this we mean the notion that only results within a specified fixed time interval are indexed and as new results appear, the old ones are discarded.

We strongly believe that this approach is fundamentally flawed as it is only able to return quality results within a short, fixed time-window and only for a limited fraction of content. There are many topics, events, discussions and quality results that fall outside of any fixed-size window. Indeed, we would argue that the span of most topics of interest fall outside any fixed-size window, and that as a result, the current approach to real-time search is essentially broken.

Preservation of older results is an essential feature of search engines in general, including real-time search. Of course, the ranking weight of older results decreases with time but their accessibility is still significant for many queries where there is a lack of quality fresh results or, more specifically, a lack of fresh results from any kind of fixed sliding-window.

There has not been much discussion about the relationship between general and real-time search. We believe there is a very strong, even formal relationship between them. In simple terms, a real-time search engine that preserves older results, with proper time-based biasing in rankings would, in time, start to converge to a general search system. The resulting general search engine would be of the highest quality, because of its unprecedented level of freshness.

Sunday, August 16, 2009

RAM Monster

This is the second installment of a four-part post that I'm doing on the general topic of Real-Time Search. The first post was on introduction to Real-Time_search .

The need for efficient use of disk resources is of great importance in controlling the cost of indexing and the real-time search problem has almost opposite requirements. It should not be surprising then that real-time search is very resource intensive and costly. The cost of indexing cannot be amortized at all since the index is constantly changing and the system must support continuous changes.

The simplest solution is to avoid use of disk altogether and rely on storing the entire index in RAM. This approach has a severe cost limitation in the amount of RAM that can be deployed at a reasonable cost. In addition it is exacerbated by reliance on available existing indexing systems (e.g. Relational Database Management Systems, Lucene) that have been optimized for disk-based systems, which do not function optimally in pure RAM environments. This is why we believe that new index architectures have to be created to address the requirements of real-time search.

Real-time search is an instance of general search and as such is subject to the scale of the entire Web and the rate of change of content on the Web. This is why scalability is of paramount importance. Scalability will be covered in more detail in subsequent posts.

Tuesday, August 11, 2009

Real-Time Search

Recently, I authored a general interest whitepaper on the timely topic of “Real-Time Search: Discovering the Web in Real-Time”. This paper is the first of several – so stay tuned!

There is much to say in the area of ‘Real-Time Search’, so it makes sense to address some of the finer points I touch on in the paper. Today’s post will serve as the first of four installments, examining the four key areas examined in the paper itself, which will be made available in entirety later this month. First, let’s summarize what we are talking about when we say “real-time search”.

The Web has become the principal medium through which users access an ever-growing wealth of information. The world is advancing at an accelerated rate, yet the contemporary means of search and discovery, search engines, remain stagnant. These means are lacking in their ability to deliver the Web to users in a timely fashion. Now, real-time search has emerged as the next generation, reducing the lags and delays in accessing quality information from the entire web.

The problem with most real-time search mechanisms has to do with comprehensiveness: in order to achieve fast performance, existing real-time search systems are limited to an extremely small set of web sites. The tension between real-time response and comprehensive coverage requires a new way of thinking about the entire architecture of search.

Real-Time Search has emerged as a fascinating new trend in search. Twitter search as well as their recent redesign of the home page has really put a spotlight on several key points:
  • The power of users and human-powered actions - even though Twitter has a very large number of users now (tens of millions), the fraction that includes (shortened) links in their tweets is small. It is quite amazing that even such a fraction can create such a powerful platform with many interesting and useful pieces such as a really small lag in showing up in tweets as well as search signals such as number of retweets.

Modern search and its key component, indexing, has been very much influenced by historical factors such as huge difference in latency of mass storage media (hard disks) and RAM. Those differences drove many decisions in indexing architectures, especially in the area of the speed of indexing.

The current real-time search activity is exposing most of those decisions as sub-optimal and driving development of new architectures. We are very early in the real-time search game and there are really no new indexing architectures yet. The main technique seems to be to buy as much RAM as possible and try to fit as big an ndex into it. That, on surface, seems to work well, but it clearly does not scale. To deal with the Web-scale, new grounds will have to be broken.

There are other pieces of the real-time search puzzle, such as the case of tradeoffs between freshness and coverage in centralized and distributed architectures. Obvious questions arise such as the issue of preservation of older results - to keep them or not?

Some more background on the topic, which puts a spotlight on the issue of ‘indexing’:

Indexing is a core part of modern search and Information Retrieval (IR) systems. Current approaches are still dominated by techniques developed a long time ago, for systems with hardware requirements very different from today. Real-time search presents a completely different set of challenges requiring a completely different approach to indexing. The challenges are especially difficult because the scope is still enormous – the entire web! – and the user’s expectations are for information being indexed as fast as it appears on the web, with a lag measured in seconds and (fractions of) minutes.

The issue of scope, or scale of the problem is very important. It used to be the case that one could index a very large corpus of data with a significant lag and expense of index creation (e.g. traditional search engines), or index a small corpus of data really fast (e.g. online news), but not both.

Our goal is to index the data appearing on the entire web, in real-time, with no compromises in quality or user experience.

One of the principal data structures in an index is a termlist. This is essentially a list of (occurrences or IDs of) documents in which a given term (keyword) appears. Simply put, an index can be thought of as a very large collection of term lists, with a fast and efficient method for their lookup and retrieval.

The principal hardware requirement that drove index design is the primary difference in the latency of disks (rotating magnetic media), and Random Access Memory (RAM). Modern disks have access latencies in the range of 10 ms while RAM access, even in cases of processor cache “misses”, is on the order of 10 ns. That’s a difference of 6 orders of magnitude!

Note that historically there has been an implicit assumption that it is not possible to fit the entire index in RAM. This assumption is increasingly untrue and we will discuss its important ramifications in later sections.

The slow access time of disks and the necessity of using them resulted in architectures optimized for avoiding disk head seeks, with the result of storing index data (term lists) in consecutive sequences of bytes that are as long as possible. Ideally, we would want every term list to be stored in a single consecutive sequence on disk.

The easiest way to store data under such a scheme was to wait for a corpus of documents to be large enough in size and then create an index where all term lists are consecutive by construction. This process of index construction was very expensive in terms of time, processing and data transfer, but since it could be done infrequently the cost could be amortized. This is how batch indexing was born.

Note that first search engines had update cycles on the order of months! Even Google, until about 2003, had update cycles on the order of weeks. They have improved the lag for some specific types of content but the update cycle for the bulk of Google’s index is still too long.

Another reason for the large cost of indexing was the advent of link-based ranking techniques such as Google's PageRank. Such techniques have significant processing costs since they require computation over the entire (sparse) Web link-graph.