google searchengine programming
Howdy,
I want to know how the google, webcrawler etc. searchengines really work as I am learning php programming and want to write a searchengine.
I have read around 10 websites, found on google, about “how searchengines work” and not a single one of them make it clear if it is the spider or the index or the search software does the ranking according to it’s ranking algorithm.
All they ever say is that, a searchengine has 3 softwares :
a) the spider
b) the index
c) the search system (search-box, template, etc.)
The spiders crawl the web collecting webpages and then forward them to the index and then the search software searches the index for the sought keywords/phrases.
Also, some say that the spiders copy the whole website into it’s index. So, in other words, there is 2 copies of a website. One residing in the website owner’s webserver and the other residing on the index of the searchengine.
So now, I can only assume 3 possibilities how a searchengine works from all this:
1.
The spider does not do the ranking according to any algorithm.
All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index.
The Index is nothing but a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website.
The search-system, when searching and finding links (in the index) gives the ranking according to the searchengine’s ranking algorithm.
This means, the spider nor the index is responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.
OR
2.
The spider does the ranking according to the searchengine’s ranking algorithm.
It visits a website and grabs all it’s html codes (copy a website) and then finally dump the html codes to it’s index. When it dumps the copies of websites it ranks them according to the searchengine’s algorithm.
The Index is nothing but a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website.
The search-system, when searching and finding links (in the index) does not give the ranking according to the searchengine’s ranking algorithm because that has been already done by the spider when dumping the data onto the index.
This means, the spider is responsible for giving the ranking and not the index nor the search-system responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.
OR
3.
The spider does not do the ranking according to any algorithm.
All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index.
The Index is not only a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website but also the system that does the ranking.
When it receives data from the spider, it ranks the links in it’s database according to the searchengine’s ranking algorithm.
The search-system, when searching and finding links (in the index) does not give the ranking according to the searchengine’s ranking algorithm.
Frankly, all it does is output a copy of certain parts of the index onto a searcher’s screen.
This means, neither the spider or the search-system is responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.
So, which assumption is correct according to the 3 above ?
Ok, I am not thinking of competing with google but you should understand that I want to run a searchengine and it should have a spider, an index and a search facility and I should be able to teach it ranking algorithms.
The web-scripts out there do not offer the admin to teach his searchengine (that runs with these ready-made web-scripts) their own ranking algorithm.
The web-script developing company built the ranking algorithms and we admins cannot change them.
The major searchengines can change their ranking algorithms from time to time when they find-out that webmasters have guessed their ranking algorithms and are abusing them to get their non-relevant websites ranked high under every keyword under the sky.
eg.
I run a search-engine. I use a ready-made web-script. My search-engine one day gets popular. Now, you decide to get traffic to your website from it.
You check what ready-made web-script I am using and you buy that script and experiment on it and find-out the ranking algorithm.
Now, you falsely optimise your website so it ranks high under every keyword on my searchengine, even those keywords that are not really related to your website. Sooner or later, people dump my searchengine. My venture comes to a dead-end.
Now, to avoid all this, I must be able to change my ranking algorithm when I fiond-out that webmasters have found-out my ranking algorithm and are abusing it.
Typical that these ready-made searchengine web-scripts do not offer the admin to change the ranking algorithm and create their own algorithms too.
Also, what is peer-to-peer searchengine ?
CptAwesome posted this at 09:20 — 18th December 2004.
He has: 370 posts
Joined: Dec 2004
Ok, let me clarify a few things:
The spider goes and pulls the pages, if you have all the info on your server, you don't need a spider.
The index is a store of all the raw data, if you have all the info on your server, this is less important.
What does the brunt of the work is a keywords system. For searching the web, a lot of information has to be processed, and pages are ranked in a large database.
When the query is put to the database, by the front end, it goes "Oh, here we go" and sends you the link(s).
On a much smaller scale, you can simply have all your information in a text file, seperated in a logical fashion, and then the end user queries it, and it gets processed.
If you know UNIX/Linux, it's like Grep, on php.net, look up preg_grep and ereg. If you do anything with searching, you'll want to know preg and ereg.
Regular expression resources:
http://regexlib.com/
and preg is similar (but has key differences) to regular expressions.
The second way in which a search engine can work is soley from keywords. This is probably a faster, but not necessarily better solution
After the spider has gone and done it's thing, you now have lets say 1000 pages from the web. Each script can be broken down into its key phrases, say, any word longer than 4 characters? Then there is a database, which has this huge list of keywords, and with each keyword an address to a page. So, the end user, using the front end script, requests "juice" any of the addresses with the keyword of "juice" can be put into an array, and then you can format that back to the enduser.
The google page rank system will just up priority of certain addresses, based on it's own ideals.
So in that case it is all the index, but the index is a lot more complex than you would have first imagined it.
Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.