how search engines work...

ROB posted this at 13:14 — 19th August 2002.

Joined: Oct 1999

im developing a project that requires a site search engine. Basically, i want the whole site searchable - content, news, forum posts, chat logs, etc.

I've looked into the way a few forums search engine works, like vbulletin, and the indexing every word method is extremely limited, for instance you can't search for a phrase, because each word is indexed separately. Not to mention it seems inefficient.

Now I look at something like Google, and billions of entire pages are searched and sorted in a fraction of a second. Now, obviously their resources are vastly greater than mine, but I only want to search one website (my own), not all of them. Surely my webserver could handle indexing and searching it's own content.

So, i'm looking for insight into how big search engines get this done. How they store and index their data, and how billions of webpages can be searched in mere milliseconds.

Peter J. Boettcher posted this at 19:52 — 19th August 2002.

They have: 812 posts

Joined: Feb 2000

I'm not a search engine expert, but I would imagine what the big search engines do, isn't terribly different from what you or I would do to search our site, just taken to the next level. All the data would be stored in efficiently indexed tables (system level) in database farms and the data would be extracted by efficient search algorithms.

They might do stuff like ignore common words (the, and, etc) to improve search speed. They also might not re-execute similar searches, that is, if I logged into Google and searched for "Thai Food", then you log in a few hours later and search for "Thai Food", you wouldn't be searching the whole database again, just getting the results I already fetched (in html). It would only re-execute the query if it is older than a set amount of time (say a day).

As for set-up, that really depends on what you're storing, but a basic set-up could be:

SearchMain table:
SearchMainID (PK Identity)
PageName varchar(50)
PageTitle varchar(50)
PageIndexDate smalldatetime

SearchItem table:
SearchItemID (PK Identity)
ItemMainID (from SearchMain table)
PageContent (text)

I separated into 2 tables to improve performance, there may be times when only the main table needs to be used, by separating the PageContent info into another table performance is kept high (by avoiding the cursor having to scan through a slow text field when it doesn't have too).

Have fun!

PJ | Are we there yet?
pjboettcher.com

ROB posted this at 17:13 — 4th September 2002.

They have: 447 posts

Joined: Oct 1999

hey thanks for the reply Peter, and sorry i didnt acknowledge it sooner.

I ran across this purely by chance last night, The Anatomy of a Large-Scale Hypertextual Web Search Engine which was written by the founders of Google. I found it fascinating, and am now enlightened.

Mark Hensler posted this at 18:08 — 4th September 2002.