Distributed Search

Tuesday, April 13, 2010

Latest Google real-time search updates

Google is continuing to make updates to their real-time search offering, the latest one being noticed by Steve Rubel - when users click on 'updates' or 'latest', for most up-to-date results, in addition to the stream of updates from Twitter, FriendFeed and public Facebook streams they will also show, on the right hand side, most cited 'top links'.

This latest change IMHO further raises several issues with their real-time search offering:

There is one more element to add to the confusion in the format of the results - in addition to a dynamic stream of differently formatted updates on top of regular results, we now have yet another differently formatted set of results. One question is what is the relationship between 'top links' and results below? Should regular results subsume top links? Or are top links somehow inferior w.r.t. ranking of the general results? Are they supposed to be just fresher?
How complete, authoritative and up-to-date are the updates, as well as associated top links? After all, they are coming from a limited set of sources and also the question is if all updates from Twitter, FriendFeed and Facebook are in there? What about the rest of the Web?
Why do we have all these formats, top links, dynamic fresh updates in the first place? Why not make sure that regular search results are always up-to-date and contain most authoritative and relevant information, including top links from these sources?

The main point I am trying to raise here is that these changes and tweaks to Google real-time search results increasingly point to limitations instead of best results.

An obvious, simple and elegant solution is to provide only regular search results that are always up-to-date and automatically include top links, with no special distinction. After, this has been the spirit of Google search from the very beginning, make it real simple and provide the best answers.

I believe the answer to these questions lies in the limitations to Google's methods of content acquisition (crawling) as well as difficulties in updating their entire index with the speed required in real-time search - seconds and minutes.

Twitter, FriendFeed, Facebook as well as pretty much all of the social web clearly show limitations of crawling - as being inherently slow and incomplete. In order to get all this content, it is necessary to rely on external feeds which are all human powered. Google has chosen to display these results in a separate window that IMHO very much detracts from the simplicity and ease of use of their standard interface.

In addition, the fact that only a fraction of these results is dynamic and up-to-date, the question is naturally raised how up-to-date and fresh are all the other general results.

I believe that their current interface is only an interim, not very successful, attempt at unifying social and real-time aspects into general search. This is the ultimate goal for them, as well as the rest of the real-time search industry.

Monday, March 22, 2010

Real-Time Search APIs

Search APIs have always been of great interest to the developer community as well as other parties, such as researchers. In addition, the issue of huge resources required for modern search gave rise to the perception that big-time search is a province of only a few. Google used to have rather limited API which has been, sadly, discontinued. Yahoo BOSS effort has been recently the leader in the field of open search APIs. Its rate limits for developers are rather generous, but still present. On the other hand, Yahoo clearly states that they plan to introduce a fee based structure for use of BOSS APIs. Another example of a widely known search API is Twitter.

It is rather interesting that rate limiting has been rather prominently featured in all search APIs to date. The main reason for this requirement, we believe, is the perception that APIs are a cost item, almost by definition. The rate limits are hence present, to put some kind of bounds on this cost. In the case of Twitter, the issue is further exacerbated by the real-time nature of the search API which places further constraints on its computation and delivery. In simpler terms, it is considerably more expensive to operate a real-time search API than a more conventional one.

Twitter has been the leader in this field and they have been quite accomodative with developers and other parties using their service. However, even they have clearly defined rate limits e.g. the number of parties using full twitter API (so called "firehose") is very limited.

A natural question arises - does it make sense to even think of an open search API with only a few, or no, resource limitations? And how would such a service look like and who would provide for resources required to operate it?

As great fans and practitioners of search art & science, we have been asking these questions ourselves and we have come to a conclusion that it would be possible to create such and open RTS API, with very few resource limitations, by leveraging unique nature of our P2P-based distributed cloud. We would like to engage the developer community as well as other interested parties in a discussion where we would like to hear what kind of interest there would be for such a project as well as what features and modes of delivery and operation would be of most interest.

Wowd distributed network is rather unique with respect to resource contribution because users themselves are able to contribute resources. Our main idea is that by contributing to our distributed cloud users should be able to access the API in a proportional way. As an example, consider BitTorrent and their policies for controlling download speeds. It is well known that if one wants to greatly speed up download rates in BitTorrent, they need to allow faster rate of contributing back by increasing upload speeds.

The nature of Wowd system is very different from BitTorrent as Wowd is about discovering, sharing and searching new content across the entire Web as opposed to BitTorrent which is about sharing media content. However, in both cases the principal reason for using user based contributions as reciprocal measures for API usage is to ensure full scalability and availability of API resources , regardless of the levels of load placed on it.

There are several ways such an API could be operated and we are eager to hear your feedback! For instance, the level of contributions can be measured by bandwidth, RAM, disk etc that are shared. In simple terms, if one is using a bigger machine with more bandwidth and other resources, by that very fact they should be able to access the API at a higher rate.

In addition to resource contributions, there is also a natural question of contributing user attention data to Wowd Attention Frontier. Of course, a single user would contribute a given amount of attention data no matter what kind of machine they are using but we are particularly interested in cases where said user would get other users to participate attention data too. Such an user should clearly be able to access the API at (much) higher rates. There are various ways such a mechanism could be implemented e.g. by introducing "attention data keys" which would be used for labeling groups of users introduced by a source.

Furthermore, in addition to increased limits of API usage, there are additional benefits that could be provided by, for instance, processing attention data to obtain affinity group specific rankings, tags, segments of attention frontier etc. These are just initial ideas, there could be many things that could be devised based on such a contribution scheme.

We are very eager to hear back from the developers, potential partners, as well as other interested parties, about their level of interest, feedback, comments about the above ideas. We feel such a scheme would be a novel way of creating an almost limit-free search API. We would love to hear if you feel the same!

Sunday, December 13, 2009

Human Premium

I have just read an excellent article on rise of fast and cheap content by Michael Arrington over at TechCrunch. It starts with the current commotion over the decline of traditional media, a trend I very much agree with. But then he goes further by pointing out that the next phase , already upon is, is the rise of vast amount of largely worthless content that is produced very cheaply.

This point is very true IMHO, but it does not apply only to the traditional media. At the heart of it is the fact that professionals and others have wised up to the basic weakness of automated means of acquisition on the Internet aka crawling - that weakness being that computers are very dumb in understanding basic things such as quality. One can have many syntactic parameters such as appearance of keywords in headings, titles, frequency of document appearance etc. but it is very easy to produce complete nonsense satisfying all that. A child can do an infinitely better job of judging comment than even the best algorithms can dream of.

In addition to content, online commerce too has been overrun with SEO plays, indeed SEO is the main leg these days of consumer Internet in general.

So what is one to do in the face of this onslaught of vast amounts of garbage floating around? I believe the results will be the (greatly) increased premium of the value of human-based quality discovery. Indeed, it will became harder and harder to sift through reams of nonsense floating around but the reward for those producing quality will be that much higher, based on people sticking with trusted producers.

For instance, there is unbelievable amount of worthless financial comments and outright disinformation floating around. It is indeed hard to penetrate that through casual discovery but if one does little bit more digging through true gems can be found. I rely heavily on a couple of financial blogs and a subscription site to get good information. Of course I watch the rest of it too, but with a big discount.

As a disclaimer, I founded, and work for a startup, Wowd, that leverages human-based discovery. I really believe that all these trends, including the one pointed out by Michael Arrington, are indicators of how the value of human input, coupled with smart automation, will greatly increase and become a key factor in the discovery of high quality content.

Wednesday, December 9, 2009

Google Launches Real-Time Search

Google has just announced a real-time search offering which is a quite interesting development. I would like to make a few comments about some of the important issues in real-time search.

Ranking is a key issue in real-time search, the same way it is in general search. Some believe that one can produce a quality real-time search experience simply by filtering real-time information streams against keywords. The deficiencies of such approaches become clear very quickly -- there is so much noise that drowns out interesting results and there are many cases in which a slightly older result is much better than a slightly newer one and it should be ranked much higher.

One of the key issues in ranking is how to include user 'editorial' input in a scalable way, meaning how to include information about what people are looking at and scale it across the entire Web. Wowd uses EdgeRank, a distributed link-based algorithm that merges user inputs with power of link-based algorithms to provide across-the-web ranking. User input is critical in discovering new information as well as finding the highest quality results. The power of distributed link-based algorithms is needed to rank enormous datasets on the Web. EdgeRank accomplishes both.

In addition, there is the issue of what to do with older results. They can be very important, in cases where there are not enough new ones yet, or they should be displayed together with newer results, because of their quality. It becomes clear that older results should be retained for future use. In fact, a good real-time search engine that preserves older results starts converging toward a good general search engine. As a consequence, the resources required for real-time search are not lesser than for general search, in fact they are greater.

Wowd uses a completely scalable approach, where the resources in the system grow with the number of users. In contrast to centralized systems, where increased number of users degrades performance, in our distributed approach additional users increase performance, both in terms of additional resources as well as increased attention frontier. By attention frontier we mean a collection of what real people find interesting on the Web at any given moment. This data is very important, as Twitter has clearly demonstrated. People worldwide are clicking on tens of billions of links daily, and what's in Twitter is only a fraction of a percent of that.

There is another important aspect of our distributed approach. One might say that no one can touch the power of Google's 50+ distributed centers. But in a distributed system, the results are served from a neighborhood of nodes close to the requestor. As the network grows and number of nodes increases, the results are served from immediate neighbors that are probably closer to the requesting user than any data center could be.

We are looking forward to Google's entrance into real-time search as it is a step in the right direction, toward improving an important aspect of search that has been missing. We would like to analyze their offering and see what additional sources are being indexed in real-time, not only Twitter, and what fraction of their entire index is updated in real-time.

Tuesday, October 20, 2009

Wowd Public Launch

It is a great, great pleasure for me to announce that Wowd has publicly launched today, you can go to wowd.com to download and check it out. There are no passwords or restrictions of any kind.

Today marks an end of the first stage of a great journey for me, which started more than three years ago, in summer of 2006. In those early days, I was getting fascinated with the power of distributed search. The vision was very clear, and many of the numbers behind power of distributed systems ( something I wrote about previously) were also clear.

We have come a long way since those days, now we have a technically solid product that works and shows the promise of the massively distributed approach. We also have a great team, with world-class technical and management foundations. And the support from our investors, DFJ and KPG Ventures has been amazing.

Of course, this is just the first stage. The real journey is only beginning now, with getting users to use Wowd and show what it can do. Stay tuned, should be great fun :)

Monday, September 28, 2009

Distributed Cloud

As promised we have another white paper on the topic of Distributed Clouds.

Clouds have become a fascinating topic. Of course, as with most very popular subjects, there is no clear definition what they really are and the concept is very broad. On the other hand, some things are starting to solidify.

Our take on clouds is a bit different. We focus on Distributed Clouds, more specifically clouds across very wide area networks such as the Internet and clouds which are comprised of many independent users in contrast to the prevalent view where there are many machines which are hosted and operated by a single entity.

The emergence of computing clouds has put a renewed emphasis on the issue of scale in computing. The enormous size of the Web, together with ever-more demanding requirements such as freshness (results in seconds, not weeks) means that massive resources are required to handle enormous datasets in a timely fashion. Datacenters are now considered to be the new units of computer power, e.g. Google's Warehouse-Scale Computer. The number of organizations able to deploy such resources is ever shrinking. Wowd aims to demonstrate that there is an even bigger scale of computing than that yet imagined -- specifically -- planetary-sized distributed clouds. Such clouds can be deployed by motivated collections of users, instead of a handful of gigantic organizations.

The definition of cloud is still not firmly established, so let us start with ours. We consider a cloud to be a collection of computing resources, where it is possible to allocate and provision additional resources in an incremental and seamless way, with no disruption to the deployed applications.

In this key respect, a cloud is not simply a group of servers co-located at some data center since with such a collection it is not simple, nor very clear, how to deploy additional machines for many tasks. Consider, for example, the task of a server supporting a Relational Database Management System. A large increase in the number of records in the database cannot be simply handled only by adding additional machines since the underlying database needs to be partitioned such that all underlying operations and queries perform in a satisfactory fashion across all of the machines. The solution in this situation requires significant re-engineering of the database application.

Clouds are considered to be collections of machines where it is possible to dynamically scale and provision additional resources for underlying application(s) with no change nor disruption to the operation. Some, such as Google, consider datacenters which are basis for clouds, to be a new form of "warehouse-scale computer" (source: "The Datacenter as a Computer", Google Inc. 2009) Clearly, the number of organizations capable of deploying such resources is small, and getting smaller, due to prohibitive cost.

Consider, as an example, P2P networks. For the longest time, indeed, since the very inception of P2P, these networks have been asssociated with a rather narrow scope of activities – principally, sharing of media content. The scale of computing occurring in such networks every moment is truly staggering. However, there is a common (mis-)perception that such massive distributed systems are good only for a very limited set of activities, specifically, the sharing of (often illicit) content. Our goal is to demonstrate that distributed networks can be a basis for tremendously powerful distributed clouds, quite literally of planetary-scale. At that scale, the power provided by such a cloud actually dwarfs the power of even the biggest proprietary clouds.

I am posting the preceding part of our white paper as a preview, if you like it, you can read the rest at Wowd Distributed Cloud .

Wednesday, September 23, 2009

Attention Frontier Efficiency

I have discussed the power of the Attention Frontier in a previous post. It is interesting to compare some of the numbers and assumptions from the post to actual numbers from Twitter feeds, which are really Twitter Attention Frontier (TAF).

First, there are, on average, about 2K urls/min posted on Twitter, in posts containing (shortened) links. That results in about 3M links/day, before deduplication, or any kind of quality, spamming or relevance analysis. This number should be considered in the context of total number of Twitter users, which is not published, though we do know that there are now over 40 million unique visitors/month. Considering such a large number of users, it becomes pretty clear that the fraction of Twitter users who are posting links is actually quite small - one can say that an average Twitter user posts about 2-3 links/month.

It is actually quite amazing that with such low "yield" of links/month/user the resulting feeds of updates are so informative. We at Wowd strongly believe that this is a very good indication of the power of the concept of Attention Frontier.

Our vision of Attention Frontier is designed to be (much) more efficient, requiring no explicit action by users. Instead, everything is derived transparently and implicitly, from their natural browsing actions. The number of potential user actions (clicks) for an average user ranges in thousands/month. Our system is designed to be most efficient in leveraging these actions.

Another interesting question is the speed of propagation on the Attention Frontier. Twitter provides a great example since the speed of propagation of the most popular trends is pretty amazing. This is actually not very surprising, since the whole point of real-time search is identification and analysis of fast-rising trends, by definition.

There is also another interesting question, of how to rank new pages so they are not swamped by older pages which had time to accumulate rank. This is actually a burning issue for many small publishers, of how to break into Google index when competing with already available material. There is even a well-known Google Sandbox, where new pages have to sit for a period of time (rumored in months) no-matter-what, to prevent people who are able to , through some trick, game the system by introducing suddenly enormous quantities of new content.

New pages can be assigned meaningful rank in terms of their conventional relevancy and quality scores, together with other indirect factors such as the reputation of the author. Of course, the overwhelming ranking signal for new pages is freshness i.e. how quickly they are discovered.

The key point is that real-time search is an instance of search, and as such it is of vital importance to pay close attention to the relevancy of results, in addition to freshness. In this respect, the nascent real-time search industry has ways to go but the journey should be pretty exciting :)