Distributed Search: 2010

Google is continuing to make updates to their real-time search offering, the latest one being noticed by Steve Rubel - when users click on 'updates' or 'latest', for most up-to-date results, in addition to the stream of updates from Twitter, FriendFeed and public Facebook streams they will also show, on the right hand side, most cited 'top links'.

This latest change IMHO further raises several issues with their real-time search offering:

There is one more element to add to the confusion in the format of the results - in addition to a dynamic stream of differently formatted updates on top of regular results, we now have yet another differently formatted set of results. One question is what is the relationship between 'top links' and results below? Should regular results subsume top links? Or are top links somehow inferior w.r.t. ranking of the general results? Are they supposed to be just fresher?
How complete, authoritative and up-to-date are the updates, as well as associated top links? After all, they are coming from a limited set of sources and also the question is if all updates from Twitter, FriendFeed and Facebook are in there? What about the rest of the Web?
Why do we have all these formats, top links, dynamic fresh updates in the first place? Why not make sure that regular search results are always up-to-date and contain most authoritative and relevant information, including top links from these sources?

The main point I am trying to raise here is that these changes and tweaks to Google real-time search results increasingly point to limitations instead of best results.

An obvious, simple and elegant solution is to provide only regular search results that are always up-to-date and automatically include top links, with no special distinction. After, this has been the spirit of Google search from the very beginning, make it real simple and provide the best answers.

I believe the answer to these questions lies in the limitations to Google's methods of content acquisition (crawling) as well as difficulties in updating their entire index with the speed required in real-time search - seconds and minutes.

Twitter, FriendFeed, Facebook as well as pretty much all of the social web clearly show limitations of crawling - as being inherently slow and incomplete. In order to get all this content, it is necessary to rely on external feeds which are all human powered. Google has chosen to display these results in a separate window that IMHO very much detracts from the simplicity and ease of use of their standard interface.

In addition, the fact that only a fraction of these results is dynamic and up-to-date, the question is naturally raised how up-to-date and fresh are all the other general results.

I believe that their current interface is only an interim, not very successful, attempt at unifying social and real-time aspects into general search. This is the ultimate goal for them, as well as the rest of the real-time search industry.

Search APIs have always been of great interest to the developer community as well as other parties, such as researchers. In addition, the issue of huge resources required for modern search gave rise to the perception that big-time search is a province of only a few. Google used to have rather limited API which has been, sadly, discontinued. Yahoo BOSS effort has been recently the leader in the field of open search APIs. Its rate limits for developers are rather generous, but still present. On the other hand, Yahoo clearly states that they plan to introduce a fee based structure for use of BOSS APIs. Another example of a widely known search API is Twitter.

It is rather interesting that rate limiting has been rather prominently featured in all search APIs to date. The main reason for this requirement, we believe, is the perception that APIs are a cost item, almost by definition. The rate limits are hence present, to put some kind of bounds on this cost. In the case of Twitter, the issue is further exacerbated by the real-time nature of the search API which places further constraints on its computation and delivery. In simpler terms, it is considerably more expensive to operate a real-time search API than a more conventional one.

Twitter has been the leader in this field and they have been quite accomodative with developers and other parties using their service. However, even they have clearly defined rate limits e.g. the number of parties using full twitter API (so called "firehose") is very limited.

A natural question arises - does it make sense to even think of an open search API with only a few, or no, resource limitations? And how would such a service look like and who would provide for resources required to operate it?

As great fans and practitioners of search art & science, we have been asking these questions ourselves and we have come to a conclusion that it would be possible to create such and open RTS API, with very few resource limitations, by leveraging unique nature of our P2P-based distributed cloud. We would like to engage the developer community as well as other interested parties in a discussion where we would like to hear what kind of interest there would be for such a project as well as what features and modes of delivery and operation would be of most interest.

Wowd distributed network is rather unique with respect to resource contribution because users themselves are able to contribute resources. Our main idea is that by contributing to our distributed cloud users should be able to access the API in a proportional way. As an example, consider BitTorrent and their policies for controlling download speeds. It is well known that if one wants to greatly speed up download rates in BitTorrent, they need to allow faster rate of contributing back by increasing upload speeds.

The nature of Wowd system is very different from BitTorrent as Wowd is about discovering, sharing and searching new content across the entire Web as opposed to BitTorrent which is about sharing media content. However, in both cases the principal reason for using user based contributions as reciprocal measures for API usage is to ensure full scalability and availability of API resources , regardless of the levels of load placed on it.

There are several ways such an API could be operated and we are eager to hear your feedback! For instance, the level of contributions can be measured by bandwidth, RAM, disk etc that are shared. In simple terms, if one is using a bigger machine with more bandwidth and other resources, by that very fact they should be able to access the API at a higher rate.

In addition to resource contributions, there is also a natural question of contributing user attention data to Wowd Attention Frontier. Of course, a single user would contribute a given amount of attention data no matter what kind of machine they are using but we are particularly interested in cases where said user would get other users to participate attention data too. Such an user should clearly be able to access the API at (much) higher rates. There are various ways such a mechanism could be implemented e.g. by introducing "attention data keys" which would be used for labeling groups of users introduced by a source.

Furthermore, in addition to increased limits of API usage, there are additional benefits that could be provided by, for instance, processing attention data to obtain affinity group specific rankings, tags, segments of attention frontier etc. These are just initial ideas, there could be many things that could be devised based on such a contribution scheme.

We are very eager to hear back from the developers, potential partners, as well as other interested parties, about their level of interest, feedback, comments about the above ideas. We feel such a scheme would be a novel way of creating an almost limit-free search API. We would love to hear if you feel the same!

Distributed Search

Tuesday, April 13, 2010

Latest Google real-time search updates

Monday, March 22, 2010

Real-Time Search APIs

Other Interesting Blogs

Blog Archive

About Me