Best way to search
Next phase of building my essay website ( www.essayswap.com) implementing a faster search engine on the site. Does anyone have any suggestions on the easiest way to search through 50,000 pages of text?
It doesn't seem that complicated, but for some reason the site hangs on occasion.
I'm using PHP and MySQL
CPRhosting.com - CPR Hosting...Giving life to the web
Custom-made Hosting Plans starting at $2
kb posted this at 16:24 — 3rd April 2003.
He has: 1,380 posts
Joined: Feb 2002
i tried it...doesnt seem that slow to me...and i'm using the computer on a school LAN right now...lol
you are using a good combo, the only thing i could say would be make the server faster or maybe figure out a way to reorganize the DB so it can be searched faster? i dont know how to do this, as i am not anywhere near being a MySQL practioner, but i believe that configuration does have something to do with it. when in doubt, upgrade the server
Mark Hensler posted this at 17:40 — 3rd April 2003.
He has: 4,048 posts
Joined: Aug 2000
I'm guessing your not searching html files, as 50,000 files would take an hour to search. So, my question is: How is your data stored/structured?
shanda posted this at 09:41 — 4th April 2003.
They have: 105 posts
Joined: Jan 2002
No, it's only text that's stored in a database. Right now the 50,000 essays aren't categorized (on my 'to do' list for the next 20 years) so the search has to go through each of the records. I've looked into things like htdig, but I really don't understand it.
The essays can be lengthy at times. Uh...not the ones I write, but the ones site visitors send in, and so I'm trying to minimize the time it takes to search the records.
CPRhosting.com - CPR Hosting...Giving life to the web
Custom-made Hosting Plans starting at $2
Suzanne posted this at 15:10 — 4th April 2003.
She has: 5,507 posts
Joined: Feb 2000
I would set up a keyword field, with a list of all the relevant keywords, and search only that field during the search. It will take you a bit more time when entering the essay, but save you a whole lot of time in the long run.
mairving posted this at 15:29 — 4th April 2003.
They have: 2,256 posts
Joined: Feb 2001
A keyword field would be the way to go. What you could do is to search based on keywords, if no results were found then search by the text. That way your most revelant would be first.
Mark Irving
I have a mind like a steel trap; it is rusty and illegal in 47 states
shanda posted this at 15:37 — 4th April 2003.
They have: 105 posts
Joined: Jan 2002
Suggestion needed: Should I create a script that extracts common words (a, the, and, or, for, etc) from the essays and then store the remaining words as keywords, or require visitors to enter keywords?
If you were a site visitor, would entering your own keywords be a bother?
CPRhosting.com - CPR Hosting...Giving life to the web
Custom-made Hosting Plans starting at $2
Suzanne posted this at 17:25 — 4th April 2003.
She has: 5,507 posts
Joined: Feb 2000
1. If you're going to use a script, you will get very poor results for keywords. You really need to run it through a human, and pick out the main concepts, even if the words aren't in the essay. Additionally, you need to include spelling errors in the keyword list.
2. I can't imagine a time when using my own keywords wouldn't be ideal -- how are you going to know what I'm searching for? That said, you need to have both the ability to search (as Mark said, first keywords, then all text), and some way to narrow the search -- categories. So, ideally, you would be able to search within a particular category. Again, human sorting would be necessary.
3. Do you mean require the visitor SEEKING an essay enter their own keywords, or require the person SUBMITTING an essay enter their own keywords. If the later, then yes, yes!
shanda posted this at 11:38 — 6th April 2003.
They have: 105 posts
Joined: Jan 2002
The person entering the essay would enter the keywords. But I'm still thinking that the search would be speeded up slightly (if not more) if I created a field that extracted common words, and made the keyword field in the database contain the results. I'll just be appreciative that folks are submitting essays to the database. I don't want to hassle them any more.
As for spell checking, hopefully we'll be getting around 50 essays/day in the near future, so I don't think that's feasible. But perhaps I can find a poor college student who wouldn't mind
CPRhosting.com - CPR Hosting...Giving life to the web
Custom-made Hosting Plans starting at $2
Suzanne posted this at 16:17 — 6th April 2003.
She has: 5,507 posts
Joined: Feb 2000
The problem with the idea of pulling out keywords with a script is what is a common word?
If a person references their work on the Inca population with internet searches, for instance, then you'll get a lot of false hits for "internet". If you say, well, then "internet" is a common word, what do you do with essays on internet technologies?
I realize it's more work to put together a solid search feature but if you want the people who are looking for the essays to find them, you HAVE to put in that kind of effort. No one will use a service that doesn't provide them with reasonably easy to attain results.
One of the largest complaints from users is not being able to accurately find information on websites. They just give up and go elsewhere.
Mark Hensler posted this at 21:20 — 6th April 2003.
He has: 4,048 posts
Joined: Aug 2000
I really feel you need to have an efficient index table. I'm not very familiar with the vBulletin code, but I have a basic understaning of their search mechanism.
Several tables are used to compose the index and perform searches. There is a `word` table which has a `word_id` and `word`. There is a `searchindex` table with `word_id`, `post_id`, and `intitle`. Then there is the `search` table with `search_id`, `query`, `post_ids`, and `dateline`.
When a new post (or essay) is submitted, you will add any new words in the essay to the `word` table. Then add the post_id (or essay_id) to the `searchindex` table for any words that are in the post (or essay).
When a person searches for "war iraq", you will query the `word` table for "war" AND "iraq" and get the `word_id` for each. Then you will query the `searchindex` table for any posts that contain the word_ids for "war" OR "iraq".
Whenever a search is performed, the results of the search are temporarily stored in the `search` table. This helps keep the server load down. Lets say that you show 10 results per page, but there are 50 posts (essays) that match "war" OR "iraq". Well, rahter than performing the search every time, we'll just pull the list of posts out of the `search` table and show the next 10. Also, if two people search for the same thing, the system will see that the search was performed only 5 minutes ago, and use the cached results in the `search` table. The system may be configurable to used cached results at a variable time (setting it to daily would mean it would perform any query only once per day).
To enhace this and keep the database smaller, you could eliminate 'noise' words from the `words` table. vBulletin restricts it's table by requiring so many characters per word (configurable).
Mark Hensler
If there is no answer on Google, then there is no question.
Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.