‘I do not seek. I find.’
Pablo Picasso
What if Match.com, the well-known dating site, worked like Google?
Absurd? Consider.
On February 12 of this year at 12:17 p.m. in Santa Clara, California, I sat down at my computer and typed the phrase ‘Roman Architecture’ into the Google search box. When the results were returned I clicked on the ‘image’ button and Google’s familiar thumbnail display presented 18 results for my consideration.
The first result was a picture of the Piazza dei Miracoli in Pisa. 1000 years too late and, as beautiful as it is, the Leaning Tower is completely inappropriate to my search. In other words, the very first result that Google returned, the one with the highest Google-relevance, was neither useful nor even related.
The second result was a car rental site that happened to put a picture of the Pantheon on their web site. Not useful.
The third result was a blog by a nice family which had spent the day at the Roman Baths in Paris. So far, so good: one useful picture.
The first actually useful result was number 13 which linked to an elaborate and discursive site on Lepcis Magna (a Roman city) and which was sponsored by a major university.
Time spent to evaluate the first page of results: 20 minutes. What was my harvest? The Leaning Tower, one blog, one car-rental site, five pictures of the pantheon and one useful site. Here’s a question that I’d like to see answered by Google: in an internet world in which there is a rich abundance of sites offering picture archives, discussions, and scholarly literature on Roman Architecture (and a myriad of other topics) why does a car rental site or a blog come way ahead of Harvard University? The flip side is even more depressing. If you page through 10 or so additional pages, looking at every result, you’ll see some very good and relevant sites listed. But they’re so far back and behind so much garbage that only the most intrepid will ever see them. They might as well not even be there at all. It’s counter intuitive that Google’s vaunted ‘algorithm’ for search not only produces garbage in response to a plain request - it also takes the relevant sites and hides them.
The fact is that Google’s search results are, at best, only marginally useful.
As an ‘index to the web’ Google is a failure.
This is not a condemnation or even a criticism of Google. None of the other popular search sites are any better. And things could be worse; at one time they were worse. Before Google came along in 1997 I well remember that, no matter what your search was, when you used one of the other search engines popular at the time the first five or six results were always local real estate agents. When Google first made real web search a possibility everyone breathed a sigh of relief. The real question is why haven’t we made any progress? Why are search results still so awful?
Full disclosure. I have a web site, squinchpix.com, which consists of about 3000 pictures of Greek and Roman architectural ruins, archaeological sites, art-works, and so forth. The pictures are very high-quality, a variety of sizes, and fully tagged and searchable so that if you want to see pictures of a fornace you only have to search on the word. This is a standard idea for a site; there are other sites like this and some of them much larger. It’s very specialized and even if it were to be publicized in the New York Times it would never garner more than a few hits a day, if that. It’s partly a labor of love and a personal challenge to see how good a web site I can design. But those few people who did come to the site would find its offerings very useful and offer useful contributions in return. It would be fun to e-mail other such enthusiasts. But, forced to rely on Google, they’ll never know about squinchpix. The Google spider crawls my web-site constantly but ’squinchpix’ will never come up above page 100 in any search of the type I described above. Why not?
The answer is that it’s up to me to prove to Google that my web site is relevant to ‘Roman architecture’ and the means by which this is done are exceedingly indirect and time-consuming. Note that. It’s up to me to prove the relevancy of my own site. Google says that they evaluate your site with their algorithm and decide what search words the site is relevant to. But that’s an illusion. The reality is precisely the reverse. It’s up to you to design your web site in a way that caters to Google’s algorithm and hope that eventually you’ll be noticed enough times so that Google will present your web-site name to searchers at a higher and higher ranking. This takes a long time and is maddening because Google doesn’t own the Internet. But our Internet experience is severely impacted by Google’s severe shortcomings. If you’re already established and well-known then Google loves you. If you’re the product of a large corporation with billions to spend on getting its message out then Google has no trouble finding you. But start a web-site on your own and you’re doomed to anonymity forever. In fact ‘net neutrality’, about which there has been so much anguished discussion, has already vanished. Google and search sites like it are the perpetrators. Here’s another example. The web site sacred-destinations.com is a large site rather like mine with thousands of high quality pictures of every aspect of Roman, Greek, and Medieval archaeology and art not just from Europe but around the world. It deserves very high visibility but Google rates it exactly five out of ten for relevancy (I shouted with laughter when I saw this. sacred-destinations should be at least 9 out of 10 for all of its meta tags) and so you don’t often see it even when you deliberately use the search phrase ‘Roman Architecture’. Google’s ‘ranking’ for sacred-destinations is like the statement that 5.4456 angels can dance on the head of a pin. It’s precise but it’s meaningless. It is just exactly what we’d expect from an AI algorithm that can never have any awareness of human concerns or purposes.
In the paragraph above I explained the concept of my site and, while you’d probably want to see it before answering, if I asked you whether the site was relevant to ‘Roman architecture’ it would take anybody with at least a sixth-grade education seconds to respond ‘Yes’. That is, it would take you seconds to perform a chore which Google’s algorithm cannot perform at all. Google’s primary means of establishing your site’s relevance is to evaluate the links that come into your site. Superficially this makes sense. If well-known and established sites from universities and various organs of the Italian archaeological service link to your site then it seems as though your site would be relevant to what they’re relevant to. But since, because of Google, they cannot know about you then they can never link to you. The average new site is caught in this almost unresolvable contradiction and so this forces the web-master to game the system. It would not surprise you, dear reader, to learn that much of a web-master’s time is expended in the activity of soliciting other web-masters (there are often forms for this), one by one, to please, pretty please, link to my site from yours so that the Google spider will increase my relevancy ranking. Thus is the Internet turned into an old-boys club. It becomes stodgy and hostile to new ideas; all because of Google’s completely useless way of creating the ‘index’.
Of course there’s one way of increasing your site’s ranking and of making it visible right away. That way is to buy ads for your site on, you guessed it, Google. If I did that then my ad would come up on page one of search results for roman art, architecture, or archaeology.
Is it any surprise, then, that it’s so hard for a new web site to become known? It may seem astounding but when you set up your web site for the first time it will be 90 to 120 days before your web site is considered by Google as a return result for search on your key words. Up until then you might as well be creating web sites on Mars. No one outside Google is quite certain why this is; the concept is referred to as Google’s ’sandbox’. ‘I’m still stuck in the sand-box’ is a common plaint among web masters. By what authority does Google delay the Internet by 4 months?
The plain fact is that mechanized relevancy ranking doesn’t work (probably deliberately) unless there’s an ad involved. Without a human being to evaluate a site’s relevance there seems to be no mathematical model that can reliably use incoming links, internal links, site maps, semantic analysis, or even keywords, to infer the actual human significance of a web site. Google is, in fact, a huge artificial intelligence project and one with about the same degree of success of other AI projects which is: hardly any.
There’s another bothersome aspect to all of this. On my website the user can search over about 2000 distinct terms. These terms relate to European architecture and art; they include place-names and materials. You’ll see a relevant picture(s) if you search on any of these terms. The relevance of my site to any of these terms is 0 (zero) according to Google. Fine. Perhaps it’s so. But if that’s so, if my site is identically useless and irrelevant to any of these terms, then why does Google constantly crawl my site, expand all my search terms, and (presumably) copy all my pictures to its server? In the hour or so that I’ve been writing this note the Googlebot has crawled my site 45 times with different search terms. This process has been going on for months. But if my site is useless for any purpose then why does Google copy all of its contents? Answer that, Google.
“Don’t be evil.” Sure.
Consider an alternative model. Take an ordinary useful website like Match.com. Match is a spectacularly successful web-site that matches people as potential partners. And although Match has a sugaring of self-help and dating tips pages its core is entirely created by millions of people seeking partners. In order to use Match you give a few personal details, specify your likes and dislikes, lie about your age, upload a flattering picture and you’re in business. If someone’s looking for a ‘male, 50’s, within 50 miles of Santa Clara‘ then my picture appears the same day. There’s no delay; no involved rumination about ‘relevance’. If a woman on Match is looking for a man in his 50’s then my picture can appear in her results immediately.
Because it’s the users who create Match’s database Match is, in fact, a kind of ‘wiki’, that is, a web site created in the main by its users. And for this reason Match (unlike Google) doesn’t search: it finds. Imagine, as I suggested above, that Match used the Google ’search’ model. It would have a humoungous search ‘algorithm’ and would constantly be trolling the entire web, especially MySpace, Friendster and personal blog pages looking for entries relevant to ‘men’, ‘women’, ‘bi-curious’, etc. When you used this hypothetical Match you’d get lots of results that were of little use. If I, a man, were searching for a woman in her fifties, I’d be presented with results that included men and women of all ages but who seemed ‘relevant’ for my search based on friends links to their blogs or MySpace pages. If I tried to get into Match as a potential date myself I would have to ‘optimize’ my MySpace page and hope that the Match.com spider found me and it would still be months before anybody would see my Match entry and even then I would only be slightly relevant (perhaps 1 out of 10) for ‘Man seeking woman’. I’d have to elevate my ‘man’ ranking by deleting my links to Borders and FlowersByWire and begging Smith and Wesson and Harley-Davidson Motorcycles to link to my personal page. The reverse would be true. Even if I were in the happiest of relationships, and without having done anything, I might suddenly find myself on Match advertised as ‘man seeking women’ (or, perhaps, ‘woman seeking man’) because something in my MySpace page had tripped the Match algorithm into thinking that I was dating.
For the activity of finding a partner this would be senseless, absurd; as it is absurd for every other topic of human interest.
The real distinction between Google and Match is this: Google provides a ’search’. Match, being user-created, provides a ‘find’.
The most incriminating thing of all for Google? Have you ever wondered why almost always the first result in a search of any kind is a Wikipedia article (or worse, the damnable and unusable “JSTOR”). It seems that almost always Wikipedia heads the list of search results. Why is that? How is it that a third-rate pile of unedited and impressionistic garbage that calls an itself an ‘encyclopaedia’ nearly always heads up Google’s search results? How can a Wikipedia article nearly always be the most relevant result for almost any search? It’s because, since Google’s algorithm is useless for finding relevant results, it has to piggyback on other sites which ARE organized. Like Wikipedia, which is arranged alphabetically.
We’ve been here before. In the days before computers all sorts of complicated compendia were put together by human beings using human judgment and these products were far superior to the automated junk that Google and other search engines spew out. We used to call those compendia ‘encyclopedias’, ‘dictionaries’, ‘thesauruses’, ‘yellow pages’, and even Migne’s Patrologia Latina.
It’s time to bring back those times in one respect at least. We need to find a way to use computers to create accurate matchings between URLs and search phrases but to inject human intelligence into the process so that sites can be accurately evaluated for their relevance to keywords and in a responsive time frame - days, not months. How can that be done?
I call for the creation of a ‘find’ wiki. In this wiki actual human contributors would connect hunan-examined web sites with appropriate key-word phrases. Like ‘Wikipedia’ thousands or hundreds of thousands of contributors could submit simple forms that match site addresses with key words. For the most part web-masters would submit their own sites. Perfect. They could log on to our wiki, fill out a form that matches their URL with whatever search terms they’d like hits for. They could be limited to 5 such phrases of which, perhaps, one or two would be unchangeable by outsiders. Or, perhaps, each search phrase for the URL would be paired with a confidence number from 0 to 100 and that number could be modified by outsiders. Its current value could be the average of the last 1000 modifications (or some such scheme). In addition such a wiki would provide a ‘find’ engine for users. It would work just like Google’s ’search’ engine except that it would actually provide useful results and it would be updated in real time. The database of URL-to-keyword matches would appear on a site that the entire internet community could access and update, just like Wikipedia. It wouldn’t be immune to the pressures of self-interest but those pressures would be on a much smaller scale than Google’s and would be further minimized by the updating of other anonymous contributors.
For whoever undertakes the creation of such a find-wiki I’ll happily buy the domain names.
Robert H. Consoli
rconsoli@yahoo.com