Sunday, December 13, 2009

Human Premium

I have just read an excellent article on rise of fast and cheap content by Michael Arrington over at TechCrunch. It starts with the current commotion over the decline of traditional media, a trend I very much agree with. But then he goes further by pointing out that the next phase , already upon is, is the rise of vast amount of largely worthless content that is produced very cheaply.

This point is very true IMHO, but it does not apply only to the traditional media. At the heart of it is the fact that professionals and others have wised up to the basic weakness of automated means of acquisition on the Internet aka crawling - that weakness being that computers are very dumb in understanding basic things such as quality. One can have many syntactic parameters such as appearance of keywords in headings, titles, frequency of document appearance etc. but it is very easy to produce complete nonsense satisfying all that. A child can do an infinitely better job of judging comment than even the best algorithms can dream of.

In addition to content, online commerce too has been overrun with SEO plays, indeed SEO is the main leg these days of consumer Internet in general.

So what is one to do in the face of this onslaught of vast amounts of garbage floating around? I believe the results will be the (greatly) increased premium of the value of human-based quality discovery. Indeed, it will became harder and harder to sift through reams of nonsense floating around but the reward for those producing quality will be that much higher, based on people sticking with trusted producers.

For instance, there is unbelievable amount of worthless financial comments and outright disinformation floating around. It is indeed hard to penetrate that through casual discovery but if one does little bit more digging through true gems can be found. I rely heavily on a couple of financial blogs and a subscription site to get good information. Of course I watch the rest of it too, but with a big discount.

As a disclaimer, I founded, and work for a startup, Wowd, that leverages human-based discovery. I really believe that all these trends, including the one pointed out by Michael Arrington, are indicators of how the value of human input, coupled with smart automation, will greatly increase and become a key factor in the discovery of high quality content.

Wednesday, December 9, 2009

Google Launches Real-Time Search

Google has just announced a real-time search offering which is a quite interesting development. I would like to make a few comments about some of the important issues in real-time search.

Ranking is a key issue in real-time search, the same way it is in general search. Some believe that one can produce a quality real-time search experience simply by filtering real-time information streams against keywords. The deficiencies of such approaches become clear very quickly -- there is so much noise that drowns out interesting results and there are many cases in which a slightly older result is much better than a slightly newer one and it should be ranked much higher.

One of the key issues in ranking is how to include user 'editorial' input in a scalable way, meaning how to include information about what people are looking at and scale it across the entire Web. Wowd uses EdgeRank, a distributed link-based algorithm that merges user inputs with power of link-based algorithms to provide across-the-web ranking. User input is critical in discovering new information as well as finding the highest quality results. The power of distributed link-based algorithms is needed to rank enormous datasets on the Web. EdgeRank accomplishes both.

In addition, there is the issue of what to do with older results. They can be very important, in cases where there are not enough new ones yet, or they should be displayed together with newer results, because of their quality. It becomes clear that older results should be retained for future use. In fact, a good real-time search engine that preserves older results starts converging toward a good general search engine. As a consequence, the resources required for real-time search are not lesser than for general search, in fact they are greater.

Wowd uses a completely scalable approach, where the resources in the system grow with the number of users. In contrast to centralized systems, where increased number of users degrades performance, in our distributed approach additional users increase performance, both in terms of additional resources as well as increased attention frontier. By attention frontier we mean a collection of what real people find interesting on the Web at any given moment. This data is very important, as Twitter has clearly demonstrated. People worldwide are clicking on tens of billions of links daily, and what's in Twitter is only a fraction of a percent of that.

There is another important aspect of our distributed approach. One might say that no one can touch the power of Google's 50+ distributed centers. But in a distributed system, the results are served from a neighborhood of nodes close to the requestor. As the network grows and number of nodes increases, the results are served from immediate neighbors that are probably closer to the requesting user than any data center could be.

We are looking forward to Google's entrance into real-time search as it is a step in the right direction, toward improving an important aspect of search that has been missing. We would like to analyze their offering and see what additional sources are being indexed in real-time, not only Twitter, and what fraction of their entire index is updated in real-time.