- There is one more element to add to the confusion in the format of the results - in addition to a dynamic stream of differently formatted updates on top of regular results, we now have yet another differently formatted set of results. One question is what is the relationship between 'top links' and results below? Should regular results subsume top links? Or are top links somehow inferior w.r.t. ranking of the general results? Are they supposed to be just fresher?
- How complete, authoritative and up-to-date are the updates, as well as associated top links? After all, they are coming from a limited set of sources and also the question is if all updates from Twitter, FriendFeed and Facebook are in there? What about the rest of the Web?
- Why do we have all these formats, top links, dynamic fresh updates in the first place? Why not make sure that regular search results are always up-to-date and contain most authoritative and relevant information, including top links from these sources?
Tuesday, April 13, 2010
Monday, March 22, 2010
Search APIs have always been of great interest to the developer community as well as other parties, such as researchers. In addition, the issue of huge resources required for modern search gave rise to the perception that big-time search is a province of only a few. Google used to have rather limited API which has been, sadly, discontinued. Yahoo BOSS effort has been recently the leader in the field of open search APIs. Its rate limits for developers are rather generous, but still present. On the other hand, Yahoo clearly states that they plan to introduce a fee based structure for use of BOSS APIs. Another example of a widely known search API is Twitter.
It is rather interesting that rate limiting has been rather prominently featured in all search APIs to date. The main reason for this requirement, we believe, is the perception that APIs are a cost item, almost by definition. The rate limits are hence present, to put some kind of bounds on this cost. In the case of Twitter, the issue is further exacerbated by the real-time nature of the search API which places further constraints on its computation and delivery. In simpler terms, it is considerably more expensive to operate a real-time search API than a more conventional one.
Twitter has been the leader in this field and they have been quite accomodative with developers and other parties using their service. However, even they have clearly defined rate limits e.g. the number of parties using full twitter API (so called "firehose") is very limited.
A natural question arises - does it make sense to even think of an open search API with only a few, or no, resource limitations? And how would such a service look like and who would provide for resources required to operate it?
As great fans and practitioners of search art & science, we have been asking these questions ourselves and we have come to a conclusion that it would be possible to create such and open RTS API, with very few resource limitations, by leveraging unique nature of our P2P-based distributed cloud. We would like to engage the developer community as well as other interested parties in a discussion where we would like to hear what kind of interest there would be for such a project as well as what features and modes of delivery and operation would be of most interest.
Wowd distributed network is rather unique with respect to resource contribution because users themselves are able to contribute resources. Our main idea is that by contributing to our distributed cloud users should be able to access the API in a proportional way. As an example, consider BitTorrent and their policies for controlling download speeds. It is well known that if one wants to greatly speed up download rates in BitTorrent, they need to allow faster rate of contributing back by increasing upload speeds.
The nature of Wowd system is very different from BitTorrent as Wowd is about discovering, sharing and searching new content across the entire Web as opposed to BitTorrent which is about sharing media content. However, in both cases the principal reason for using user based contributions as reciprocal measures for API usage is to ensure full scalability and availability of API resources , regardless of the levels of load placed on it.
There are several ways such an API could be operated and we are eager to hear your feedback! For instance, the level of contributions can be measured by bandwidth, RAM, disk etc that are shared. In simple terms, if one is using a bigger machine with more bandwidth and other resources, by that very fact they should be able to access the API at a higher rate.
In addition to resource contributions, there is also a natural question of contributing user attention data to Wowd Attention Frontier. Of course, a single user would contribute a given amount of attention data no matter what kind of machine they are using but we are particularly interested in cases where said user would get other users to participate attention data too. Such an user should clearly be able to access the API at (much) higher rates. There are various ways such a mechanism could be implemented e.g. by introducing "attention data keys" which would be used for labeling groups of users introduced by a source.
Furthermore, in addition to increased limits of API usage, there are additional benefits that could be provided by, for instance, processing attention data to obtain affinity group specific rankings, tags, segments of attention frontier etc. These are just initial ideas, there could be many things that could be devised based on such a contribution scheme.
We are very eager to hear back from the developers, potential partners, as well as other interested parties, about their level of interest, feedback, comments about the above ideas. We feel such a scheme would be a novel way of creating an almost limit-free search API. We would love to hear if you feel the same!
Sunday, December 13, 2009
Wednesday, December 9, 2009
Ranking is a key issue in real-time search, the same way it is in general search. Some believe that one can produce a quality real-time search experience simply by filtering real-time information streams against keywords. The deficiencies of such approaches become clear very quickly -- there is so much noise that drowns out interesting results and there are many cases in which a slightly older result is much better than a slightly newer one and it should be ranked much higher.
One of the key issues in ranking is how to include user 'editorial' input in a scalable way, meaning how to include information about what people are looking at and scale it across the entire Web. Wowd uses EdgeRank, a distributed link-based algorithm that merges user inputs with power of link-based algorithms to provide across-the-web ranking. User input is critical in discovering new information as well as finding the highest quality results. The power of distributed link-based algorithms is needed to rank enormous datasets on the Web. EdgeRank accomplishes both.
In addition, there is the issue of what to do with older results. They can be very important, in cases where there are not enough new ones yet, or they should be displayed together with newer results, because of their quality. It becomes clear that older results should be retained for future use. In fact, a good real-time search engine that preserves older results starts converging toward a good general search engine. As a consequence, the resources required for real-time search are not lesser than for general search, in fact they are greater.
Wowd uses a completely scalable approach, where the resources in the system grow with the number of users. In contrast to centralized systems, where increased number of users degrades performance, in our distributed approach additional users increase performance, both in terms of additional resources as well as increased attention frontier. By attention frontier we mean a collection of what real people find interesting on the Web at any given moment. This data is very important, as Twitter has clearly demonstrated. People worldwide are clicking on tens of billions of links daily, and what's in Twitter is only a fraction of a percent of that.
There is another important aspect of our distributed approach. One might say that no one can touch the power of Google's 50+ distributed centers. But in a distributed system, the results are served from a neighborhood of nodes close to the requestor. As the network grows and number of nodes increases, the results are served from immediate neighbors that are probably closer to the requesting user than any data center could be.
We are looking forward to Google's entrance into real-time search as it is a step in the right direction, toward improving an important aspect of search that has been missing. We would like to analyze their offering and see what additional sources are being indexed in real-time, not only Twitter, and what fraction of their entire index is updated in real-time.
Tuesday, October 20, 2009
Today marks an end of the first stage of a great journey for me, which started more than three years ago, in summer of 2006. In those early days, I was getting fascinated with the power of distributed search. The vision was very clear, and many of the numbers behind power of distributed systems ( something I wrote about previously) were also clear.
We have come a long way since those days, now we have a technically solid product that works and shows the promise of the massively distributed approach. We also have a great team, with world-class technical and management foundations. And the support from our investors, DFJ and KPG Ventures has been amazing.
Of course, this is just the first stage. The real journey is only beginning now, with getting users to use Wowd and show what it can do. Stay tuned, should be great fun :)
Monday, September 28, 2009
Clouds have become a fascinating topic. Of course, as with most very popular subjects, there is no clear definition what they really are and the concept is very broad. On the other hand, some things are starting to solidify.
Our take on clouds is a bit different. We focus on Distributed Clouds, more specifically clouds across very wide area networks such as the Internet and clouds which are comprised of many independent users in contrast to the prevalent view where there are many machines which are hosted and operated by a single entity.
The emergence of computing clouds has put a renewed emphasis on the issue of scale in computing. The enormous size of the Web, together with ever-more demanding requirements such as freshness (results in seconds, not weeks) means that massive resources are required to handle enormous datasets in a timely fashion. Datacenters are now considered to be the new units of computer power, e.g. Google's Warehouse-Scale Computer. The number of organizations able to deploy such resources is ever shrinking. Wowd aims to demonstrate that there is an even bigger scale of computing than that yet imagined -- specifically -- planetary-sized distributed clouds. Such clouds can be deployed by motivated collections of users, instead of a handful of gigantic organizations.
The definition of cloud is still not firmly established, so let us start with ours. We consider a cloud to be a collection of computing resources, where it is possible to allocate and provision additional resources in an incremental and seamless way, with no disruption to the deployed applications.
In this key respect, a cloud is not simply a group of servers co-located at some data center since with such a collection it is not simple, nor very clear, how to deploy additional machines for many tasks. Consider, for example, the task of a server supporting a Relational Database Management System. A large increase in the number of records in the database cannot be simply handled only by adding additional machines since the underlying database needs to be partitioned such that all underlying operations and queries perform in a satisfactory fashion across all of the machines. The solution in this situation requires significant re-engineering of the database application.
Clouds are considered to be collections of machines where it is possible to dynamically scale and provision additional resources for underlying application(s) with no change nor disruption to the operation. Some, such as Google, consider datacenters which are basis for clouds, to be a new form of "warehouse-scale computer" (source: "The Datacenter as a Computer", Google Inc. 2009) Clearly, the number of organizations capable of deploying such resources is small, and getting smaller, due to prohibitive cost.
Consider, as an example, P2P networks. For the longest time, indeed, since the very inception of P2P, these networks have been asssociated with a rather narrow scope of activities – principally, sharing of media content. The scale of computing occurring in such networks every moment is truly staggering. However, there is a common (mis-)perception that such massive distributed systems are good only for a very limited set of activities, specifically, the sharing of (often illicit) content. Our goal is to demonstrate that distributed networks can be a basis for tremendously powerful distributed clouds, quite literally of planetary-scale. At that scale, the power provided by such a cloud actually dwarfs the power of even the biggest proprietary clouds.
I am posting the preceding part of our white paper as a preview, if you like it, you can read the rest at Wowd Distributed Cloud .
Wednesday, September 23, 2009
First, there are, on average, about 2K urls/min posted on Twitter, in posts containing (shortened) links. That results in about 3M links/day, before deduplication, or any kind of quality, spamming or relevance analysis. This number should be considered in the context of total number of Twitter users, which is not published, though we do know that there are now over 40 million unique visitors/month. Considering such a large number of users, it becomes pretty clear that the fraction of Twitter users who are posting links is actually quite small - one can say that an average Twitter user posts about 2-3 links/month.
It is actually quite amazing that with such low "yield" of links/month/user the resulting feeds of updates are so informative. We at Wowd strongly believe that this is a very good indication of the power of the concept of Attention Frontier.
Our vision of Attention Frontier is designed to be (much) more efficient, requiring no explicit action by users. Instead, everything is derived transparently and implicitly, from their natural browsing actions. The number of potential user actions (clicks) for an average user ranges in thousands/month. Our system is designed to be most efficient in leveraging these actions.
Another interesting question is the speed of propagation on the Attention Frontier. Twitter provides a great example since the speed of propagation of the most popular trends is pretty amazing. This is actually not very surprising, since the whole point of real-time search is identification and analysis of fast-rising trends, by definition.
There is also another interesting question, of how to rank new pages so they are not swamped by older pages which had time to accumulate rank. This is actually a burning issue for many small publishers, of how to break into Google index when competing with already available material. There is even a well-known Google Sandbox, where new pages have to sit for a period of time (rumored in months) no-matter-what, to prevent people who are able to , through some trick, game the system by introducing suddenly enormous quantities of new content.