Recently, I authored a general interest whitepaper on the timely topic of “Real-Time Search: Discovering the Web in Real-Time”. This paper is the first of several – so stay tuned!
There is much to say in the area of ‘Real-Time Search’, so it makes sense to address some of the finer points I touch on in the paper. Today’s post will serve as the first of four installments, examining the four key areas examined in the paper itself, which will be made available in entirety later this month. First, let’s summarize what we are talking about when we say “real-time search”.
The Web has become the principal medium through which users access an ever-growing wealth of information. The world is advancing at an accelerated rate, yet the contemporary means of search and discovery, search engines, remain stagnant. These means are lacking in their ability to deliver the Web to users in a timely fashion. Now, real-time search has emerged as the next generation, reducing the lags and delays in accessing quality information from the entire web.
The problem with most real-time search mechanisms has to do with comprehensiveness: in order to achieve fast performance, existing real-time search systems are limited to an extremely small set of web sites. The tension between real-time response and comprehensive coverage requires a new way of thinking about the entire architecture of search.
Real-Time Search has emerged as a fascinating new trend in search.
Twitter search as well as their recent
redesign of the home page has really put a spotlight on several key points:
- The power of users and human-powered actions - even though Twitter has a very large number of users now (tens of millions), the fraction that includes (shortened) links in their tweets is small. It is quite amazing that even such a fraction can create such a powerful platform with many interesting and useful pieces such as a really small lag in showing up in tweets as well as search signals such as number of retweets.
Modern search and its key component, indexing, has been very much influenced by historical factors such as huge difference in latency of mass storage media (hard disks) and RAM. Those differences drove many decisions in indexing architectures, especially in the area of the speed of indexing.
The current real-time search activity is exposing most of those decisions as sub-optimal and driving development of new architectures. We are very early in the real-time search game and there are really no new indexing architectures yet. The main technique seems to be to buy as much RAM as possible and try to fit as big an ndex into it. That, on surface, seems to work well, but it clearly does not scale. To deal with the Web-scale, new grounds will have to be broken.
There are other pieces of the real-time search puzzle, such as the case of tradeoffs between freshness and coverage in centralized and distributed architectures. Obvious questions arise such as the issue of preservation of older results - to keep them or not?
Some more background on the topic, which puts a spotlight on the issue of ‘indexing’:
Indexing is a core part of modern search and Information Retrieval (IR) systems. Current approaches are still dominated by techniques developed a long time ago, for systems with hardware requirements very different from today. Real-time search presents a completely different set of challenges requiring a completely different approach to indexing. The challenges are especially difficult because the scope is still enormous – the entire web! – and the user’s expectations are for information being indexed as fast as it appears on the web, with a lag measured in seconds and (fractions of) minutes.
The issue of scope, or scale of the problem is very important. It used to be the case that one could index a very large corpus of data with a significant lag and expense of index creation (e.g. traditional search engines), or index a small corpus of data really fast (e.g. online news), but not both.
Our goal is to index the data appearing on the entire web, in real-time, with no compromises in quality or user experience.
One of the principal data structures in an index is a
termlist. This is essentially a list of (occurrences or IDs of) documents in which a given term (keyword) appears. Simply put, an index can be thought of as a very large collection of term lists, with a fast and efficient method for their lookup and retrieval.
The principal hardware requirement that drove index design is the primary difference in the latency of disks (rotating magnetic media), and Random Access Memory (RAM). Modern disks have access latencies in the range of 10 ms while RAM access, even in cases of processor cache “misses”, is on the order of 10 ns. That’s a difference of 6 orders of magnitude!
Note that historically there has been an implicit assumption that it is not possible to fit the entire index in RAM. This assumption is increasingly untrue and we will discuss its important ramifications in later sections.
The slow access time of disks and the necessity of using them resulted in architectures optimized for avoiding disk head seeks, with the result of storing index data (term lists) in consecutive sequences of bytes that are as long as possible. Ideally, we would want every term list to be stored in a single consecutive sequence on disk.
The easiest way to store data under such a scheme was to wait for a corpus of documents to be large enough in size and then create an index where all term lists are consecutive by construction. This process of index construction was very expensive in terms of time, processing and data transfer, but since it could be done infrequently the cost could be amortized. This is how batch indexing was born.
Note that first search engines had update cycles on the order of months! Even Google, until about 2003, had update cycles on the order of weeks. They have improved the lag for some specific types of content but the update cycle for the bulk of Google’s index is still too long.
Another reason for the large cost of indexing was the advent of link-based ranking techniques such as Google's PageRank. Such techniques have significant processing costs since they require computation over the entire (sparse) Web link-graph.