Distributed Search: 2009

Sunday, December 13, 2009

Human Premium

I have just read an excellent article on rise of fast and cheap content by Michael Arrington over at TechCrunch. It starts with the current commotion over the decline of traditional media, a trend I very much agree with. But then he goes further by pointing out that the next phase , already upon is, is the rise of vast amount of largely worthless content that is produced very cheaply.

This point is very true IMHO, but it does not apply only to the traditional media. At the heart of it is the fact that professionals and others have wised up to the basic weakness of automated means of acquisition on the Internet aka crawling - that weakness being that computers are very dumb in understanding basic things such as quality. One can have many syntactic parameters such as appearance of keywords in headings, titles, frequency of document appearance etc. but it is very easy to produce complete nonsense satisfying all that. A child can do an infinitely better job of judging comment than even the best algorithms can dream of.

In addition to content, online commerce too has been overrun with SEO plays, indeed SEO is the main leg these days of consumer Internet in general.

So what is one to do in the face of this onslaught of vast amounts of garbage floating around? I believe the results will be the (greatly) increased premium of the value of human-based quality discovery. Indeed, it will became harder and harder to sift through reams of nonsense floating around but the reward for those producing quality will be that much higher, based on people sticking with trusted producers.

For instance, there is unbelievable amount of worthless financial comments and outright disinformation floating around. It is indeed hard to penetrate that through casual discovery but if one does little bit more digging through true gems can be found. I rely heavily on a couple of financial blogs and a subscription site to get good information. Of course I watch the rest of it too, but with a big discount.

As a disclaimer, I founded, and work for a startup, Wowd, that leverages human-based discovery. I really believe that all these trends, including the one pointed out by Michael Arrington, are indicators of how the value of human input, coupled with smart automation, will greatly increase and become a key factor in the discovery of high quality content.

Wednesday, December 9, 2009

Google Launches Real-Time Search

Google has just announced a real-time search offering which is a quite interesting development. I would like to make a few comments about some of the important issues in real-time search.

Ranking is a key issue in real-time search, the same way it is in general search. Some believe that one can produce a quality real-time search experience simply by filtering real-time information streams against keywords. The deficiencies of such approaches become clear very quickly -- there is so much noise that drowns out interesting results and there are many cases in which a slightly older result is much better than a slightly newer one and it should be ranked much higher.

One of the key issues in ranking is how to include user 'editorial' input in a scalable way, meaning how to include information about what people are looking at and scale it across the entire Web. Wowd uses EdgeRank, a distributed link-based algorithm that merges user inputs with power of link-based algorithms to provide across-the-web ranking. User input is critical in discovering new information as well as finding the highest quality results. The power of distributed link-based algorithms is needed to rank enormous datasets on the Web. EdgeRank accomplishes both.

In addition, there is the issue of what to do with older results. They can be very important, in cases where there are not enough new ones yet, or they should be displayed together with newer results, because of their quality. It becomes clear that older results should be retained for future use. In fact, a good real-time search engine that preserves older results starts converging toward a good general search engine. As a consequence, the resources required for real-time search are not lesser than for general search, in fact they are greater.

Wowd uses a completely scalable approach, where the resources in the system grow with the number of users. In contrast to centralized systems, where increased number of users degrades performance, in our distributed approach additional users increase performance, both in terms of additional resources as well as increased attention frontier. By attention frontier we mean a collection of what real people find interesting on the Web at any given moment. This data is very important, as Twitter has clearly demonstrated. People worldwide are clicking on tens of billions of links daily, and what's in Twitter is only a fraction of a percent of that.

There is another important aspect of our distributed approach. One might say that no one can touch the power of Google's 50+ distributed centers. But in a distributed system, the results are served from a neighborhood of nodes close to the requestor. As the network grows and number of nodes increases, the results are served from immediate neighbors that are probably closer to the requesting user than any data center could be.

We are looking forward to Google's entrance into real-time search as it is a step in the right direction, toward improving an important aspect of search that has been missing. We would like to analyze their offering and see what additional sources are being indexed in real-time, not only Twitter, and what fraction of their entire index is updated in real-time.

Tuesday, October 20, 2009

Wowd Public Launch

It is a great, great pleasure for me to announce that Wowd has publicly launched today, you can go to wowd.com to download and check it out. There are no passwords or restrictions of any kind.

Today marks an end of the first stage of a great journey for me, which started more than three years ago, in summer of 2006. In those early days, I was getting fascinated with the power of distributed search. The vision was very clear, and many of the numbers behind power of distributed systems ( something I wrote about previously) were also clear.

We have come a long way since those days, now we have a technically solid product that works and shows the promise of the massively distributed approach. We also have a great team, with world-class technical and management foundations. And the support from our investors, DFJ and KPG Ventures has been amazing.

Of course, this is just the first stage. The real journey is only beginning now, with getting users to use Wowd and show what it can do. Stay tuned, should be great fun :)

Monday, September 28, 2009

Distributed Cloud

As promised we have another white paper on the topic of Distributed Clouds.

Clouds have become a fascinating topic. Of course, as with most very popular subjects, there is no clear definition what they really are and the concept is very broad. On the other hand, some things are starting to solidify.

Our take on clouds is a bit different. We focus on Distributed Clouds, more specifically clouds across very wide area networks such as the Internet and clouds which are comprised of many independent users in contrast to the prevalent view where there are many machines which are hosted and operated by a single entity.

The emergence of computing clouds has put a renewed emphasis on the issue of scale in computing. The enormous size of the Web, together with ever-more demanding requirements such as freshness (results in seconds, not weeks) means that massive resources are required to handle enormous datasets in a timely fashion. Datacenters are now considered to be the new units of computer power, e.g. Google's Warehouse-Scale Computer. The number of organizations able to deploy such resources is ever shrinking. Wowd aims to demonstrate that there is an even bigger scale of computing than that yet imagined -- specifically -- planetary-sized distributed clouds. Such clouds can be deployed by motivated collections of users, instead of a handful of gigantic organizations.

The definition of cloud is still not firmly established, so let us start with ours. We consider a cloud to be a collection of computing resources, where it is possible to allocate and provision additional resources in an incremental and seamless way, with no disruption to the deployed applications.

In this key respect, a cloud is not simply a group of servers co-located at some data center since with such a collection it is not simple, nor very clear, how to deploy additional machines for many tasks. Consider, for example, the task of a server supporting a Relational Database Management System. A large increase in the number of records in the database cannot be simply handled only by adding additional machines since the underlying database needs to be partitioned such that all underlying operations and queries perform in a satisfactory fashion across all of the machines. The solution in this situation requires significant re-engineering of the database application.

Clouds are considered to be collections of machines where it is possible to dynamically scale and provision additional resources for underlying application(s) with no change nor disruption to the operation. Some, such as Google, consider datacenters which are basis for clouds, to be a new form of "warehouse-scale computer" (source: "The Datacenter as a Computer", Google Inc. 2009) Clearly, the number of organizations capable of deploying such resources is small, and getting smaller, due to prohibitive cost.

Consider, as an example, P2P networks. For the longest time, indeed, since the very inception of P2P, these networks have been asssociated with a rather narrow scope of activities – principally, sharing of media content. The scale of computing occurring in such networks every moment is truly staggering. However, there is a common (mis-)perception that such massive distributed systems are good only for a very limited set of activities, specifically, the sharing of (often illicit) content. Our goal is to demonstrate that distributed networks can be a basis for tremendously powerful distributed clouds, quite literally of planetary-scale. At that scale, the power provided by such a cloud actually dwarfs the power of even the biggest proprietary clouds.

I am posting the preceding part of our white paper as a preview, if you like it, you can read the rest at Wowd Distributed Cloud .

Wednesday, September 23, 2009

Attention Frontier Efficiency

I have discussed the power of the Attention Frontier in a previous post. It is interesting to compare some of the numbers and assumptions from the post to actual numbers from Twitter feeds, which are really Twitter Attention Frontier (TAF).

First, there are, on average, about 2K urls/min posted on Twitter, in posts containing (shortened) links. That results in about 3M links/day, before deduplication, or any kind of quality, spamming or relevance analysis. This number should be considered in the context of total number of Twitter users, which is not published, though we do know that there are now over 40 million unique visitors/month. Considering such a large number of users, it becomes pretty clear that the fraction of Twitter users who are posting links is actually quite small - one can say that an average Twitter user posts about 2-3 links/month.

It is actually quite amazing that with such low "yield" of links/month/user the resulting feeds of updates are so informative. We at Wowd strongly believe that this is a very good indication of the power of the concept of Attention Frontier.

Our vision of Attention Frontier is designed to be (much) more efficient, requiring no explicit action by users. Instead, everything is derived transparently and implicitly, from their natural browsing actions. The number of potential user actions (clicks) for an average user ranges in thousands/month. Our system is designed to be most efficient in leveraging these actions.

Another interesting question is the speed of propagation on the Attention Frontier. Twitter provides a great example since the speed of propagation of the most popular trends is pretty amazing. This is actually not very surprising, since the whole point of real-time search is identification and analysis of fast-rising trends, by definition.

There is also another interesting question, of how to rank new pages so they are not swamped by older pages which had time to accumulate rank. This is actually a burning issue for many small publishers, of how to break into Google index when competing with already available material. There is even a well-known Google Sandbox, where new pages have to sit for a period of time (rumored in months) no-matter-what, to prevent people who are able to , through some trick, game the system by introducing suddenly enormous quantities of new content.

New pages can be assigned meaningful rank in terms of their conventional relevancy and quality scores, together with other indirect factors such as the reputation of the author. Of course, the overwhelming ranking signal for new pages is freshness i.e. how quickly they are discovered.

The key point is that real-time search is an instance of search, and as such it is of vital importance to pay close attention to the relevancy of results, in addition to freshness. In this respect, the nascent real-time search industry has ways to go but the journey should be pretty exciting :)

Wednesday, August 26, 2009

Sliding Window

This is the third post in a series on Real-Time Search. The previous ones were introduction to Real-Time Search and RAM monster. In case you are also interested in a good article on some recent insights about search take a look at a discussion with SurfCanyon CEO.

One prominent feature of current real-time systems, that so far has not received much attention, is the sliding-window nature of their indexing. By this we mean the notion that only results within a specified fixed time interval are indexed and as new results appear, the old ones are discarded.

We strongly believe that this approach is fundamentally flawed as it is only able to return quality results within a short, fixed time-window and only for a limited fraction of content. There are many topics, events, discussions and quality results that fall outside of any fixed-size window. Indeed, we would argue that the span of most topics of interest fall outside any fixed-size window, and that as a result, the current approach to real-time search is essentially broken.

Preservation of older results is an essential feature of search engines in general, including real-time search. Of course, the ranking weight of older results decreases with time but their accessibility is still significant for many queries where there is a lack of quality fresh results or, more specifically, a lack of fresh results from any kind of fixed sliding-window.

There has not been much discussion about the relationship between general and real-time search. We believe there is a very strong, even formal relationship between them. In simple terms, a real-time search engine that preserves older results, with proper time-based biasing in rankings would, in time, start to converge to a general search system. The resulting general search engine would be of the highest quality, because of its unprecedented level of freshness.

Sunday, August 16, 2009

RAM Monster

This is the second installment of a four-part post that I'm doing on the general topic of Real-Time Search. The first post was on introduction to Real-Time_search .

The need for efficient use of disk resources is of great importance in controlling the cost of indexing and the real-time search problem has almost opposite requirements. It should not be surprising then that real-time search is very resource intensive and costly. The cost of indexing cannot be amortized at all since the index is constantly changing and the system must support continuous changes.

The simplest solution is to avoid use of disk altogether and rely on storing the entire index in RAM. This approach has a severe cost limitation in the amount of RAM that can be deployed at a reasonable cost. In addition it is exacerbated by reliance on available existing indexing systems (e.g. Relational Database Management Systems, Lucene) that have been optimized for disk-based systems, which do not function optimally in pure RAM environments. This is why we believe that new index architectures have to be created to address the requirements of real-time search.

Real-time search is an instance of general search and as such is subject to the scale of the entire Web and the rate of change of content on the Web. This is why scalability is of paramount importance. Scalability will be covered in more detail in subsequent posts.

Tuesday, August 11, 2009

Real-Time Search

Recently, I authored a general interest whitepaper on the timely topic of “Real-Time Search: Discovering the Web in Real-Time”. This paper is the first of several – so stay tuned!

There is much to say in the area of ‘Real-Time Search’, so it makes sense to address some of the finer points I touch on in the paper. Today’s post will serve as the first of four installments, examining the four key areas examined in the paper itself, which will be made available in entirety later this month. First, let’s summarize what we are talking about when we say “real-time search”.

The Web has become the principal medium through which users access an ever-growing wealth of information. The world is advancing at an accelerated rate, yet the contemporary means of search and discovery, search engines, remain stagnant. These means are lacking in their ability to deliver the Web to users in a timely fashion. Now, real-time search has emerged as the next generation, reducing the lags and delays in accessing quality information from the entire web.

The problem with most real-time search mechanisms has to do with comprehensiveness: in order to achieve fast performance, existing real-time search systems are limited to an extremely small set of web sites. The tension between real-time response and comprehensive coverage requires a new way of thinking about the entire architecture of search.

Real-Time Search has emerged as a fascinating new trend in search. Twitter search as well as their recent redesign of the home page has really put a spotlight on several key points:

Freshness is still very lacking in present search, including Google. In fact, Google has recently demonstrated that they are in a somewhat reactive mode with respect to Twitter, where it seems that they want to respond but it is not clear how and with what tools. Some examples include noises about Google indexing Twitter feeds , creating a microblogging search vertical and launching a next-generation search architecture (Google Caffeine) with much more emphasis on speed of indexing.

The power of users and human-powered actions - even though Twitter has a very large number of users now (tens of millions), the fraction that includes (shortened) links in their tweets is small. It is quite amazing that even such a fraction can create such a powerful platform with many interesting and useful pieces such as a really small lag in showing up in tweets as well as search signals such as number of retweets.

Modern search and its key component, indexing, has been very much influenced by historical factors such as huge difference in latency of mass storage media (hard disks) and RAM. Those differences drove many decisions in indexing architectures, especially in the area of the speed of indexing.

The current real-time search activity is exposing most of those decisions as sub-optimal and driving development of new architectures. We are very early in the real-time search game and there are really no new indexing architectures yet. The main technique seems to be to buy as much RAM as possible and try to fit as big an ndex into it. That, on surface, seems to work well, but it clearly does not scale. To deal with the Web-scale, new grounds will have to be broken.

There are other pieces of the real-time search puzzle, such as the case of tradeoffs between freshness and coverage in centralized and distributed architectures. Obvious questions arise such as the issue of preservation of older results - to keep them or not?

Some more background on the topic, which puts a spotlight on the issue of ‘indexing’:

Indexing is a core part of modern search and Information Retrieval (IR) systems. Current approaches are still dominated by techniques developed a long time ago, for systems with hardware requirements very different from today. Real-time search presents a completely different set of challenges requiring a completely different approach to indexing. The challenges are especially difficult because the scope is still enormous – the entire web! – and the user’s expectations are for information being indexed as fast as it appears on the web, with a lag measured in seconds and (fractions of) minutes.

The issue of scope, or scale of the problem is very important. It used to be the case that one could index a very large corpus of data with a significant lag and expense of index creation (e.g. traditional search engines), or index a small corpus of data really fast (e.g. online news), but not both.

Our goal is to index the data appearing on the entire web, in real-time, with no compromises in quality or user experience.

One of the principal data structures in an index is a termlist. This is essentially a list of (occurrences or IDs of) documents in which a given term (keyword) appears. Simply put, an index can be thought of as a very large collection of term lists, with a fast and efficient method for their lookup and retrieval.

The principal hardware requirement that drove index design is the primary difference in the latency of disks (rotating magnetic media), and Random Access Memory (RAM). Modern disks have access latencies in the range of 10 ms while RAM access, even in cases of processor cache “misses”, is on the order of 10 ns. That’s a difference of 6 orders of magnitude!

Note that historically there has been an implicit assumption that it is not possible to fit the entire index in RAM. This assumption is increasingly untrue and we will discuss its important ramifications in later sections.

The slow access time of disks and the necessity of using them resulted in architectures optimized for avoiding disk head seeks, with the result of storing index data (term lists) in consecutive sequences of bytes that are as long as possible. Ideally, we would want every term list to be stored in a single consecutive sequence on disk.

The easiest way to store data under such a scheme was to wait for a corpus of documents to be large enough in size and then create an index where all term lists are consecutive by construction. This process of index construction was very expensive in terms of time, processing and data transfer, but since it could be done infrequently the cost could be amortized. This is how batch indexing was born.

Note that first search engines had update cycles on the order of months! Even Google, until about 2003, had update cycles on the order of weeks. They have improved the lag for some specific types of content but the update cycle for the bulk of Google’s index is still too long.

Another reason for the large cost of indexing was the advent of link-based ranking techniques such as Google's PageRank. Such techniques have significant processing costs since they require computation over the entire (sparse) Web link-graph.

Tuesday, May 26, 2009

Attention Frontier

The search industry has been relying on crawling since its inception. The crawlers themselves have hardly changed since the very beginning, which is really amazing. One surely might expect a bit of progress since, say, 1994?

In essence, crawlers are really simple programs, repeating endlessly the same sequence of simple tasks: fetch web pages, get all links on them, eliminate duplicates and put all the new links in a queue to be crawled; repeat. There are additional complications of masking latency of fetching individual pages by running many crawler instances in parallel, also making sure robots.txt constraints are satisfied across all instances all the time. That is for the most part it, if you do all of these things well, you have a world-class crawler.

So what, one might say. Crawling has been working just fine, why rock the boat?

Consider modern-day spammers, black-hat SEOs and the like; what is their most effective weapon? It is the capability to add, in uncontrollable fashion, huge amounts of content (hundreds of millions and billions) of pages aimed at fooling ranking algorithms to artificially boost their ranking on Google et al. Because the search industry likes to keep this aura of know-it-all, creating an impression that such actions do not matter, they can handle it no-matter-what, their adversaries take this to the bank creating ever bigger and bigger piles of useless stuff to fool them.

The Attention Frontier is the notion that all the stuff a group of well-meaning users looks at matters a lot. If such a group is large enough and they do it long enough than the corpus of pages they liked and looked at is very valuable for others in that it is timely, relevant and high quality, so a search through it yields very interesting results.

You may say that such an approach would have no chance of covering huge portions of the long tail but that is not really the case.

It would cover relevant and interesting pieces of the long tail users looked at, which are of great interest anyway.
The notion that absolutely everything is and should be crawled and indexed is very wrong – search engines know this very well and already index only a very small (and declining) fraction of the links which are available to them. This is an interesting subject in itself and I will discuss it in more detail in a future post.
There are ways of expanding the Attention Frontier in an automated, or semi-automated way. One could say it would be (distributed) crawling, sure, as long as we get good results out of it instead of piles of junk. If one is to do it, the key is to do it carefully, in a measured fashion.

To get an idea how big the Attention Frontier could be consider an average Internet user. It is not inconceivable that they click on hundreds of links daily and tens of thousands annually.

Let us look at the numbers for users more closely. There is an important distinction between numbers of simultaneously active users at any given time, and the numbers of users during a period, e.g., a day. The latter is important because all users are capable of contributing attention data, not only the ones currently active. These two numbers are different because users can join and sign off during any period so the number of active users will be smaller.

The concept of currently active users contribution has been initially popularized by P2P companies, most notable Skype; more recently it has been pushed by Facebook and its apps as a better measure of apps popularity, instead of number of downloads.

Consider a group of, say, 100,000 simultaneously active users ; it would not be unreasonable to assume that the number of daily users would be 10x resulting in 1M. Further assuming 100 clicks/day average contribution, we get 100M clicks/day = 36.5B/year.

A corpus like that would be a great core for a pretty good search. Of course, it would be great for lots of other stuff, including discovery, recommendation, ranking, queries and lots of other goodies. A corpus of 1M+ users would generate hundreds of millions of clicks daily and would reach into billions in a matter of days and weeks.

There is the issue of duplicates, i.e. most popular links will be clicked on by many people, reducing the number of new links found by the group. But far from being a drawback, this issue is actually a great advantage since the number of duplicates is actually a popularity measure and would be a great ranking signal.

The issue is whether the tail of the distribution of links in the Attention Frontier would be long enough and we think it definitely would. As I said, there are ways of expanding it, in a measured fashion, but I do not believe (semi-)automated expansion is critical.

The data in the Attention Frontier, user attention data, is really extremely valuable. The search industry knows this and has been trying to collect and use this data for some time now. But they have not been transparent about it, instead adopting an attitude “just give us your data, we will store it for you and don't worry about it”.

Of course, privacy is at the center of the Attention Frontier. The users should be free to choose what to do about their user attention data. We want to tell them openly about the value of the data, and what can be done with it and invite them to participate and share what they feel comfortable sharing. The entire process should be much more transparent and open than it is now.

The Attention Frontier is at the core of what we have been doing in search. As our launch approaches, you will be able to see how our vision of the Attention Frontier and scalable distributed search, discovery and recommendation works.

Discovery and recommendation are intrinsically related to search. Crawling is really an automated process of search discovery. Recommendation is linked to ranking and quality. If one has a sense about what is good, a jump to recommendation is not that big. We will talk more about discovery and recommendation soon, as they will be part of our offering from the start.

Distributed Search