how search engines work...
im developing a project that requires a site search engine. Basically, i want the whole site searchable - content, news, forum posts, chat logs, etc.
I've looked into the way a few forums search engine works, like vbulletin, and the indexing every word method is extremely limited, for instance you can't search for a phrase, because each word is indexed separately. Not to mention it seems inefficient.
Now I look at something like Google, and billions of entire pages are searched and sorted in a fraction of a second. Now, obviously their resources are vastly greater than mine, but I only want to search one website (my own), not all of them. Surely my webserver could handle indexing and searching it's own content.
So, i'm looking for insight into how big search engines get this done. How they store and index their data, and how billions of webpages can be searched in mere milliseconds.
Peter J. Boettcher posted this at 19:52 — 19th August 2002.
They have: 812 posts
Joined: Feb 2000
I'm not a search engine expert, but I would imagine what the big search engines do, isn't terribly different from what you or I would do to search our site, just taken to the next level. All the data would be stored in efficiently indexed tables (system level) in database farms and the data would be extracted by efficient search algorithms.
They might do stuff like ignore common words (the, and, etc) to improve search speed. They also might not re-execute similar searches, that is, if I logged into Google and searched for "Thai Food", then you log in a few hours later and search for "Thai Food", you wouldn't be searching the whole database again, just getting the results I already fetched (in html). It would only re-execute the query if it is older than a set amount of time (say a day).
As for set-up, that really depends on what you're storing, but a basic set-up could be:
SearchMain table:
SearchMainID (PK Identity)
PageName varchar(50)
PageTitle varchar(50)
PageIndexDate smalldatetime
SearchItem table:
SearchItemID (PK Identity)
ItemMainID (from SearchMain table)
PageContent (text)
I separated into 2 tables to improve performance, there may be times when only the main table needs to be used, by separating the PageContent info into another table performance is kept high (by avoiding the cursor having to scan through a slow text field when it doesn't have too).
Have fun!
PJ | Are we there yet?
pjboettcher.com
ROB posted this at 17:13 — 4th September 2002.
They have: 447 posts
Joined: Oct 1999
hey thanks for the reply Peter, and sorry i didnt acknowledge it sooner.
I ran across this purely by chance last night, The Anatomy of a Large-Scale Hypertextual Web Search Engine which was written by the founders of Google. I found it fascinating, and am now enlightened.
Mark Hensler posted this at 18:08 — 4th September 2002.
He has: 4,048 posts
Joined: Aug 2000
Looks like I have something to read for my boring 1-4PM class today.
Mark Hensler posted this at 18:11 — 4th September 2002.
He has: 4,048 posts
Joined: Aug 2000
Woh, neato!
The ID of this TWF thread is 19021. The URL of that doc includes 1921.
Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.