This is the second installment of a four-part post that I'm doing on the general topic of Real-Time Search. The first post was on introduction to Real-Time_search .
The need for efficient use of disk resources is of great importance in controlling the cost of indexing and the real-time search problem has almost opposite requirements. It should not be surprising then that real-time search is very resource intensive and costly. The cost of indexing cannot be amortized at all since the index is constantly changing and the system must support continuous changes.
The simplest solution is to avoid use of disk altogether and rely on storing the entire index in RAM. This approach has a severe cost limitation in the amount of RAM that can be deployed at a reasonable cost. In addition it is exacerbated by reliance on available existing indexing systems (e.g. Relational Database Management Systems, Lucene) that have been optimized for disk-based systems, which do not function optimally in pure RAM environments. This is why we believe that new index architectures have to be created to address the requirements of real-time search.
Real-time search is an instance of general search and as such is subject to the scale of the entire Web and the rate of change of content on the Web. This is why scalability is of paramount importance. Scalability will be covered in more detail in subsequent posts.