Best way to search

They have: 105 posts

Joined: Jan 2002

Next phase of building my essay website ( www.essayswap.com) implementing a faster search engine on the site. Does anyone have any suggestions on the easiest way to search through 50,000 pages of text?

It doesn't seem that complicated, but for some reason the site hangs on occasion.

I'm using PHP and MySQL

CPRhosting.com - CPR Hosting...Giving life to the web
Custom-made Hosting Plans starting at $2

He has: 1,380 posts

Joined: Feb 2002

i tried it...doesnt seem that slow to me...and i'm using the computer on a school LAN right now...lol

you are using a good combo, the only thing i could say would be make the server faster or maybe figure out a way to reorganize the DB so it can be searched faster? i dont know how to do this, as i am not anywhere near being a MySQL practioner, but i believe that configuration does have something to do with it. when in doubt, upgrade the server

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

I'm guessing your not searching html files, as 50,000 files would take an hour to search. So, my question is: How is your data stored/structured?

They have: 105 posts

Joined: Jan 2002

No, it's only text that's stored in a database. Right now the 50,000 essays aren't categorized (on my 'to do' list for the next 20 years) so the search has to go through each of the records. I've looked into things like htdig, but I really don't understand it.

The essays can be lengthy at times. Uh...not the ones I write, but the ones site visitors send in, and so I'm trying to minimize the time it takes to search the records.

CPRhosting.com - CPR Hosting...Giving life to the web
Custom-made Hosting Plans starting at $2

Suzanne's picture

She has: 5,507 posts

Joined: Feb 2000

I would set up a keyword field, with a list of all the relevant keywords, and search only that field during the search. It will take you a bit more time when entering the essay, but save you a whole lot of time in the long run.

mairving's picture

They have: 2,256 posts

Joined: Feb 2001

A keyword field would be the way to go. What you could do is to search based on keywords, if no results were found then search by the text. That way your most revelant would be first.

Mark Irving
I have a mind like a steel trap; it is rusty and illegal in 47 states

They have: 105 posts

Joined: Jan 2002

Suggestion needed: Should I create a script that extracts common words (a, the, and, or, for, etc) from the essays and then store the remaining words as keywords, or require visitors to enter keywords?

If you were a site visitor, would entering your own keywords be a bother?

CPRhosting.com - CPR Hosting...Giving life to the web
Custom-made Hosting Plans starting at $2

Suzanne's picture

She has: 5,507 posts

Joined: Feb 2000

1. If you're going to use a script, you will get very poor results for keywords. You really need to run it through a human, and pick out the main concepts, even if the words aren't in the essay. Additionally, you need to include spelling errors in the keyword list.

2. I can't imagine a time when using my own keywords wouldn't be ideal -- how are you going to know what I'm searching for? That said, you need to have both the ability to search (as Mark said, first keywords, then all text), and some way to narrow the search -- categories. So, ideally, you would be able to search within a particular category. Again, human sorting would be necessary.

3. Do you mean require the visitor SEEKING an essay enter their own keywords, or require the person SUBMITTING an essay enter their own keywords. If the later, then yes, yes!

They have: 105 posts

Joined: Jan 2002

The person entering the essay would enter the keywords. But I'm still thinking that the search would be speeded up slightly (if not more) if I created a field that extracted common words, and made the keyword field in the database contain the results. I'll just be appreciative that folks are submitting essays to the database. I don't want to hassle them any more. Smiling

As for spell checking, hopefully we'll be getting around 50 essays/day in the near future, so I don't think that's feasible. But perhaps I can find a poor college student who wouldn't mind Smiling

CPRhosting.com - CPR Hosting...Giving life to the web
Custom-made Hosting Plans starting at $2

Suzanne's picture

She has: 5,507 posts

Joined: Feb 2000

The problem with the idea of pulling out keywords with a script is what is a common word?

If a person references their work on the Inca population with internet searches, for instance, then you'll get a lot of false hits for "internet". If you say, well, then "internet" is a common word, what do you do with essays on internet technologies?

I realize it's more work to put together a solid search feature but if you want the people who are looking for the essays to find them, you HAVE to put in that kind of effort. No one will use a service that doesn't provide them with reasonably easy to attain results.

One of the largest complaints from users is not being able to accurately find information on websites. They just give up and go elsewhere.

Mark Hensler's picture

He has: 4,048 posts

Joined: Aug 2000

I really feel you need to have an efficient index table. I'm not very familiar with the vBulletin code, but I have a basic understaning of their search mechanism.

Several tables are used to compose the index and perform searches. There is a `word` table which has a `word_id` and `word`. There is a `searchindex` table with `word_id`, `post_id`, and `intitle`. Then there is the `search` table with `search_id`, `query`, `post_ids`, and `dateline`.

When a new post (or essay) is submitted, you will add any new words in the essay to the `word` table. Then add the post_id (or essay_id) to the `searchindex` table for any words that are in the post (or essay).

When a person searches for "war iraq", you will query the `word` table for "war" AND "iraq" and get the `word_id` for each. Then you will query the `searchindex` table for any posts that contain the word_ids for "war" OR "iraq".

Whenever a search is performed, the results of the search are temporarily stored in the `search` table. This helps keep the server load down. Lets say that you show 10 results per page, but there are 50 posts (essays) that match "war" OR "iraq". Well, rahter than performing the search every time, we'll just pull the list of posts out of the `search` table and show the next 10. Also, if two people search for the same thing, the system will see that the search was performed only 5 minutes ago, and use the cached results in the `search` table. The system may be configurable to used cached results at a variable time (setting it to daily would mean it would perform any query only once per day).

To enhace this and keep the database smaller, you could eliminate 'noise' words from the `words` table. vBulletin restricts it's table by requiring so many characters per word (configurable).

Mark Hensler
If there is no answer on Google, then there is no question.

Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.