In essence, crawlers are really simple programs, repeating endlessly the same sequence of simple tasks: fetch web pages, get all links on them, eliminate duplicates and put all the new links in a queue to be crawled; repeat. There are additional complications of masking latency of fetching individual pages by running many crawler instances in parallel, also making sure robots.txt constraints are satisfied across all instances all the time. That is for the most part it, if you do all of these things well, you have a world-class crawler.
So what, one might say. Crawling has been working just fine, why rock the boat?
Consider modern-day spammers, black-hat SEOs and the like; what is their most effective weapon? It is the capability to add, in uncontrollable fashion, huge amounts of content (hundreds of millions and billions) of pages aimed at fooling ranking algorithms to artificially boost their ranking on Google et al. Because the search industry likes to keep this aura of know-it-all, creating an impression that such actions do not matter, they can handle it no-matter-what, their adversaries take this to the bank creating ever bigger and bigger piles of useless stuff to fool them.
The Attention Frontier is the notion that all the stuff a group of well-meaning users looks at matters a lot. If such a group is large enough and they do it long enough than the corpus of pages they liked and looked at is very valuable for others in that it is timely, relevant and high quality, so a search through it yields very interesting results.
You may say that such an approach would have no chance of covering huge portions of the long tail but that is not really the case.
- It would cover relevant and interesting pieces of the long tail users looked at, which are of great interest anyway.
- The notion that absolutely everything is and should be crawled and indexed is very wrong – search engines know this very well and already index only a very small (and declining) fraction of the links which are available to them. This is an interesting subject in itself and I will discuss it in more detail in a future post.
- There are ways of expanding the Attention Frontier in an automated, or semi-automated way. One could say it would be (distributed) crawling, sure, as long as we get good results out of it instead of piles of junk. If one is to do it, the key is to do it carefully, in a measured fashion.
To get an idea how big the Attention Frontier could be consider an average Internet user. It is not inconceivable that they click on hundreds of links daily and tens of thousands annually.
Let us look at the numbers for users more closely. There is an important distinction between numbers of simultaneously active users at any given time, and the numbers of users during a period, e.g., a day. The latter is important because all users are capable of contributing attention data, not only the ones currently active. These two numbers are different because users can join and sign off during any period so the number of active users will be smaller.
The concept of currently active users contribution has been initially popularized by P2P companies, most notable Skype; more recently it has been pushed by Facebook and its apps as a better measure of apps popularity, instead of number of downloads.
Consider a group of, say, 100,000 simultaneously active users ; it would not be unreasonable to assume that the number of daily users would be 10x resulting in 1M. Further assuming 100 clicks/day average contribution, we get 100M clicks/day = 36.5B/year.
A corpus like that would be a great core for a pretty good search. Of course, it would be great for lots of other stuff, including discovery, recommendation, ranking, queries and lots of other goodies. A corpus of 1M+ users would generate hundreds of millions of clicks daily and would reach into billions in a matter of days and weeks.
There is the issue of duplicates, i.e. most popular links will be clicked on by many people, reducing the number of new links found by the group. But far from being a drawback, this issue is actually a great advantage since the number of duplicates is actually a popularity measure and would be a great ranking signal.
The issue is whether the tail of the distribution of links in the Attention Frontier would be long enough and we think it definitely would. As I said, there are ways of expanding it, in a measured fashion, but I do not believe (semi-)automated expansion is critical.
The data in the Attention Frontier, user attention data, is really extremely valuable. The search industry knows this and has been trying to collect and use this data for some time now. But they have not been transparent about it, instead adopting an attitude “just give us your data, we will store it for you and don't worry about it”.
Of course, privacy is at the center of the Attention Frontier. The users should be free to choose what to do about their user attention data. We want to tell them openly about the value of the data, and what can be done with it and invite them to participate and share what they feel comfortable sharing. The entire process should be much more transparent and open than it is now.
The Attention Frontier is at the core of what we have been doing in search. As our launch approaches, you will be able to see how our vision of the Attention Frontier and scalable distributed search, discovery and recommendation works.
Discovery and recommendation are intrinsically related to search. Crawling is really an automated process of search discovery. Recommendation is linked to ranking and quality. If one has a sense about what is good, a jump to recommendation is not that big. We will talk more about discovery and recommendation soon, as they will be part of our offering from the start.