Monday, September 28, 2009

Distributed Cloud

As promised we have another white paper on the topic of Distributed Clouds.

Clouds have become a fascinating topic. Of course, as with most very popular subjects, there is no clear definition what they really are and the concept is very broad. On the other hand, some things are starting to solidify.

Our take on clouds is a bit different. We focus on Distributed Clouds, more specifically clouds across very wide area networks such as the Internet and clouds which are comprised of many independent users in contrast to the prevalent view where there are many machines which are hosted and operated by a single entity.

The emergence of computing clouds has put a renewed emphasis on the issue of scale in computing. The enormous size of the Web, together with ever-more demanding requirements such as freshness (results in seconds, not weeks) means that massive resources are required to handle enormous datasets in a timely fashion. Datacenters are now considered to be the new units of computer power, e.g. Google's Warehouse-Scale Computer. The number of organizations able to deploy such resources is ever shrinking. Wowd aims to demonstrate that there is an even bigger scale of computing than that yet imagined -- specifically -- planetary-sized distributed clouds. Such clouds can be deployed by motivated collections of users, instead of a handful of gigantic organizations.

The definition of cloud is still not firmly established, so let us start with ours. We consider a cloud to be a collection of computing resources, where it is possible to allocate and provision additional resources in an incremental and seamless way, with no disruption to the deployed applications.

In this key respect, a cloud is not simply a group of servers co-located at some data center since with such a collection it is not simple, nor very clear, how to deploy additional machines for many tasks. Consider, for example, the task of a server supporting a Relational Database Management System. A large increase in the number of records in the database cannot be simply handled only by adding additional machines since the underlying database needs to be partitioned such that all underlying operations and queries perform in a satisfactory fashion across all of the machines. The solution in this situation requires significant re-engineering of the database application.

Clouds are considered to be collections of machines where it is possible to dynamically scale and provision additional resources for underlying application(s) with no change nor disruption to the operation. Some, such as Google, consider datacenters which are basis for clouds, to be a new form of "warehouse-scale computer" (source: "The Datacenter as a Computer", Google Inc. 2009) Clearly, the number of organizations capable of deploying such resources is small, and getting smaller, due to prohibitive cost.

Consider, as an example, P2P networks. For the longest time, indeed, since the very inception of P2P, these networks have been asssociated with a rather narrow scope of activities – principally, sharing of media content. The scale of computing occurring in such networks every moment is truly staggering. However, there is a common (mis-)perception that such massive distributed systems are good only for a very limited set of activities, specifically, the sharing of (often illicit) content. Our goal is to demonstrate that distributed networks can be a basis for tremendously powerful distributed clouds, quite literally of planetary-scale. At that scale, the power provided by such a cloud actually dwarfs the power of even the biggest proprietary clouds.

I am posting the preceding part of our white paper as a preview, if you like it, you can read the rest at Wowd Distributed Cloud .

Wednesday, September 23, 2009

Attention Frontier Efficiency

I have discussed the power of the Attention Frontier in a previous post. It is interesting to compare some of the numbers and assumptions from the post to actual numbers from Twitter feeds, which are really Twitter Attention Frontier (TAF).

First, there are, on average, about 2K urls/min posted on Twitter, in posts containing (shortened) links. That results in about 3M links/day, before deduplication, or any kind of quality, spamming or relevance analysis. This number should be considered in the context of total number of Twitter users, which is not published, though we do know that there are now over 40 million unique visitors/month. Considering such a large number of users, it becomes pretty clear that the fraction of Twitter users who are posting links is actually quite small - one can say that an average Twitter user posts about 2-3 links/month.

It is actually quite amazing that with such low "yield" of links/month/user the resulting feeds of updates are so informative. We at Wowd strongly believe that this is a very good indication of the power of the concept of Attention Frontier.

Our vision of Attention Frontier is designed to be (much) more efficient, requiring no explicit action by users. Instead, everything is derived transparently and implicitly, from their natural browsing actions. The number of potential user actions (clicks) for an average user ranges in thousands/month. Our system is designed to be most efficient in leveraging these actions.

Another interesting question is the speed of propagation on the Attention Frontier. Twitter provides a great example since the speed of propagation of the most popular trends is pretty amazing. This is actually not very surprising, since the whole point of real-time search is identification and analysis of fast-rising trends, by definition.

There is also another interesting question, of how to rank new pages so they are not swamped by older pages which had time to accumulate rank. This is actually a burning issue for many small publishers, of how to break into Google index when competing with already available material. There is even a well-known Google Sandbox, where new pages have to sit for a period of time (rumored in months) no-matter-what, to prevent people who are able to , through some trick, game the system by introducing suddenly enormous quantities of new content.

New pages can be assigned meaningful rank in terms of their conventional relevancy and quality scores, together with other indirect factors such as the reputation of the author. Of course, the overwhelming ranking signal for new pages is freshness i.e. how quickly they are discovered.

The key point is that real-time search is an instance of search, and as such it is of vital importance to pay close attention to the relevancy of results, in addition to freshness. In this respect, the nascent real-time search industry has ways to go but the journey should be pretty exciting :)