Sunday 9 November 2008

Where Internet Search Fails - Outsmart Google

You know, the thing about being a geek is that it is a charachter. Something inborn that you can't change. You can become a financier, you can become a businessman, an artist or a comedian, but its this thing that when you retire to your room, in your warm comfy bed, free of society and alone with your thoughts that is when who you truly are surfaces, when no one is looking.
I am a lot of things, but, end come to end, I am a geek, and a techie geek (geek by definition is someone passionate about something so much it defines him- it's not restricted to technology- u can be a photography geek, or a stamp geek....). I always wanted to be a techie, coz I always was a geek. I do it for the joy. It's a geek's world, and a techie heaven, and I always wanted to be the guy who makes people look silly when it came to computers. Now I am one, and I may be into a lot of things, like running the London Bridge Festival club, and being into finance and business, but, when someone slightly mentions anything remotely techie, I drift away on techie tangents and the internals of how it works yada yada yada... you can't run away from who you are.

This is by far, a geek post, with a capital G.

Anyway, we were talking the other day about search engine optimization, and I instantly broke it down in my mind, following my usual habit of coming up with an innovative idea, building a business around it and running it to the ground of its flaws within 30 minutes. (bad, bad habit , come on bright idea please come! am poor!) .

I am sure you all heard about this small company called Google. Well, they are big on internet search. Besides the fact that I believe that the main reason Google is so successful is that they have an empty, basic, homepage, with just the search bar. It's human psychology- simplicity - is the way to keep people coming back. Now, behind that search bar, is the famous PageRank algorithm, u know where Google does its calculations of how high the page should rank by computing all these different factors, such as inbound links, outbound links, keywords, meta tags, and the works. The idea behind this is to come up with a number, that is associated with each page, related to a given search querry, and these pages accordingly.

In addition, they have something called an index, which, is something like hashtable to which each and every page (and I mean each- well almost) page on the internet hashes to one value within that table. So, when you have that index, and you compute a page's hashvalue based on the PageRank algorithm, u'll get your search engine results, ranked by the closest match between your querry, and the hashvalue within that index. The complete equation is not completely released by Google, or any of the other search engines on the web, but the idea is that they try to "approximate" what the closest value in that index is related to your querry.

Trick: if you want to get a rough estimate of the size of Google's index, just enter a random large string of characters like "fdjadkmfjakjfdakjsangfkjgsafsa" in Google, and u will get like 1 or zero results in the search, but it will say "Results 1 - 10 of about 19,000,000,000 pages" which you can estimate the number of pages in Google's index. Lately it seems though that Google noticed that, and when u enter such a random string, it doesn't spit out this index size anymore but rather rejects the querry or whatever. This is probably because of the index debate, with Yahoo claiming to have 20billion pages in its index, Google claiming to have 3 times that, and the new company Cuil (pronounced cool- company which is going nowhere) claims to have twice the index of Google. (Funny that there website crashed the day it was launched).

This is all nitty gritty and nice, except there is an upper boundary for what internet search can do. Your search results are limited to the text the user enters. No matter which way you want to put it, and no matter how many web crawlers and algorithms engines apply, they just can't get into a user's mind. You will only find what you need/want on the internet is if you know what you are looking for. i.e., if you are cluless of what you are searching for, for example "that movie that the guy says 'bloody as hell' ", well if you enter "bloody as hell" , you'll probably won't get very far. Entering "bloody as hell movie script" will probably have Pulp Fiction in your first few hits and you are there. But, the point here, is that your brain did most of the narrowing down of the set you want to search for. It's not very difficult for search engine optimizers (SEO) to have algorithms that simply figure out that when you enter the word "movie" after some text, it will look first in areas related to movies. You made the job straight forward when adding the word "script". Piece of cake. In simplified ways the engine would do this -> movies subset of index -> scripts -> sentence -> bloody as hell. (Now play drums).

The great Alan Turing set the upper bound for what computers can do, with his Turing machine, and he proved, beyond a doubt, with his Turing test, that computers can't and will never think, and similarly its the same bound for for search engines: they can't get into your head.

These days, with the growth of the internet, and the immense computing power for a massive search engine like Google, they could maybe index each and every word, but a better way is that they can harness user data, by keeping logs of the search querries, and minning the data, to make a better educated guess. If most people who enter "sex for free" go to the same website, even if it's PageRank is shite, Google will notice that, and will "bump-up" its rank in the search results, using guesstimates based on visits. It's the same with social networks and getting users to create Google accounts and all that stuff. Google don't care about reading your emails, or knowing which pages you visited to spy on you or find out if you are a pervert- what they really want- is your data as a "user" to give them a better search engine. (At least I like to think so).

But these are just variations and optimizations of the same thing: trying to figure out what the user is thinking, by comparing it to what most people were thinking when they entered the same thing. But, that's just a speculation, and there is always the X variable, that can negate it. I may be a weird dude thinking "looking where everyone has looked before, but thought no one had thought before" ! (paraphrased from Albert Von Szent-Gyorgyi)

You want a trick to break Google? Trying to find out what "the" is in English grammar. Google removes any occurance of "the" because its way too common. Entering "definition: the" helps a bit, but not that much either- but, by using "definition: " u already proved my point, of humanly limiting the search universe :)

They will mine my website and guess my trick and stop it, and I will find another one, because...... I am a geek who drinks too much coffee.

No comments: