google api
Is there an api that will allow me to use googles indexes / directory or any other directory to search through in order to create my pwn search engine?
Or anyone know how to create a web spider that can go and collect website urls and store in a database? Suppose it can be a meta spider, only collecting meta information and the url that it is from?
JeevesBond posted this at 04:38 — 20th February 2007.
He has: 3,956 posts
Joined: Jun 2002
There's no out of the box solution (as far as I'm aware) that will do this. I would start out with a simple shell script that runs wget (which can automatically go through a web site, following links and downloading pages). Then go through those pages, adding the required information to either MySQL or--easier for scripting purposes--SQLite.
To get you started, here's a call to wget that will make it work through webmaster-forums.net and download every page:
wget -r -w2 http://www.webmaster-forums.net
'Important note: the [incode]-w2[/incode] parameter is very important, it makes the script wait two seconds between downloading pages. If you don't include this wait it is very likely webmasters will ban you from their site. Unless you're offering something useful in return, you need to be very careful about using peoples bandwidth.
Also, there is no reason why you couldn't include this as a system call from Ruby, PHP etc. I don't think getting your own search results is the easiest path, using Google's will be easier (but less interesting). Whilst never having used Google's search API, there are plenty of sites around that use Google as their 'in-site' search engine, e.g. W3C. Not sure how they managed to do it, and you'd have to include their branding.
Did a search and found this: http://www.dankarran.com/googleapi-phpsitesearch/ that uses their API (but has globals in *yuck*). Could be a good starting point.
EDIT: and please don't destroy TWF with that script.
a Padded Cell our articles site!
benf posted this at 12:11 — 20th February 2007.
They have: 426 posts
Joined: Feb 2005
Ok, thanks for the detailed reply. Just need to check some things then before i go messing around with something i have no idea about.
The shell wget - yes i have used this on linux using ssh to download .tar files to install. Now, do i simply use it in the same way - through putty or any other shell command application or can i call it through php?
There must be a way to set a limit on the amount of stuff i download?
Good Value Professional VPS Hosting
JeevesBond posted this at 23:48 — 20th February 2007.
He has: 3,956 posts
Joined: Jun 2002
Here's the outout of [incode]wget --help[/incode], items that should be of interest to you are highlighted in red:
GNU Wget 1.10.2, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...
Mandatory arguments to long options are mandatory for short options too.
Startup:
-V, --version display the version of Wget and exit.
-h, --help print this help.
-b, --background go to background after startup.
-e, --execute=COMMAND execute a `.wgetrc'-style command.
Logging and input file:
-o, --output-file=FILE log messages to FILE.
-a, --append-output=FILE append messages to FILE.
-d, --debug print lots of debugging information.
-q, --quiet quiet (no output).
-v, --verbose be verbose (this is the default).
-nv, --no-verbose turn off verboseness, without being quiet.
-i, --input-file=FILE download URLs found in FILE.
-F, --force-html treat input file as HTML.
-B, --base=URL prefixs URL to relative links in -F -i file.
Download:
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits).
--retry-connrefused retry even if connection is refused.
-O, --output-document=FILE write documents to FILE.
-nc, --no-clobber skip downloads that would download to
existing files.
-c, --continue resume getting a partially-downloaded file.
--progress=TYPE select progress gauge type.
-N, --timestamping don't re-retrieve files unless newer than
local.
-S, --server-response print server response.
--spider don't download anything.
-T, --timeout=SECONDS set all timeout values to SECONDS.
--dns-timeout=SECS set the DNS lookup timeout to SECS.
--connect-timeout=SECS set the connect timeout to SECS.
--read-timeout=SECS set the read timeout to SECS.
-w, --wait=SECONDS wait SECONDS between retrievals.
--waitretry=SECONDS wait 1..SECONDS between retries of a retrieval.
--random-wait wait from 0...2*WAIT secs between retrievals.
-Y, --proxy explicitly turn on proxy.
--no-proxy explicitly turn off proxy.
<strong> -Q, --quota=NUMBER set retrieval quota to NUMBER.</strong>
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host.
--limit-rate=RATE limit download rate to RATE.
--no-dns-cache disable caching DNS lookups.
--restrict-file-names=OS restrict chars in file names to ones OS allows.
-4, --inet4-only connect only to IPv4 addresses.
-6, --inet6-only connect only to IPv6 addresses.
--prefer-family=FAMILY connect first to addresses of specified family,
one of IPv6, IPv4, or none.
--user=USER set both ftp and http user to USER.
--password=PASS set both ftp and http password to PASS.
Directories:
-nd, --no-directories don't create directories.
-x, --force-directories force creation of directories.
-nH, --no-host-directories don't create host directories.
--protocol-directories use protocol name in directories.
-P, --directory-prefix=PREFIX save files to PREFIX/...
--cut-dirs=NUMBER ignore NUMBER remote directory components.
HTTP options:
--http-user=USER set http user to USER.
--http-password=PASS set http password to PASS.
--no-cache disallow server-cached data.
-E, --html-extension save HTML documents with `.html' extension.
--ignore-length ignore `Content-Length' header field.
--header=STRING insert STRING among the headers.
--proxy-user=USER set USER as proxy username.
--proxy-password=PASS set PASS as proxy password.
--referer=URL include `Referer: URL' header in HTTP request.
--save-headers save the HTTP headers to file.
-U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION.
--no-http-keep-alive disable HTTP keep-alive (persistent connections).
--no-cookies don't use cookies.
--load-cookies=FILE load cookies from FILE before session.
--save-cookies=FILE save cookies to FILE after session.
--keep-session-cookies load and save session (non-permanent) cookies.
--post-data=STRING use the POST method; send STRING as the data.
--post-file=FILE use the POST method; send contents of FILE.
HTTPS (SSL/TLS) options:
--secure-protocol=PR choose secure protocol, one of auto, SSLv2,
SSLv3, and TLSv1.
--no-check-certificate don't validate the server's certificate.
--certificate=FILE client certificate file.
--certificate-type=TYPE client certificate type, PEM or DER.
--private-key=FILE private key file.
--private-key-type=TYPE private key type, PEM or DER.
--ca-certificate=FILE file with the bundle of CA's.
--ca-directory=DIR directory where hash list of CA's is stored.
--random-file=FILE file with random data for seeding the SSL PRNG.
--egd-file=FILE file naming the EGD socket with random data.
FTP options:
--ftp-user=USER set ftp user to USER.
--ftp-password=PASS set ftp password to PASS.
--no-remove-listing don't remove `.listing' files.
--no-glob turn off FTP file name globbing.
--no-passive-ftp disable the "passive" transfer mode.
--retr-symlinks when recursing, get linked-to files (not dir).
--preserve-permissions preserve remote file permissions.
Recursive download:
-r, --recursive specify recursive download.
<strong> -l, --level=NUMBER maximum recursion depth (inf or 0 for infinite).</strong>
--delete-after delete files locally after downloading them.
<strong> -k, --convert-links make links in downloaded HTML point to local files.</strong>
-K, --backup-converted before converting file X, back up as X.orig.
-m, --mirror shortcut for -N -r -l inf --no-remove-listing.
-p, --page-requisites get all images, etc. needed to display HTML page.
--strict-comments turn on strict (SGML) handling of HTML comments.
Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions.
-R, --reject=LIST comma-separated list of rejected extensions.
-D, --domains=LIST comma-separated list of accepted domains.
--exclude-domains=LIST comma-separated list of rejected domains.
--follow-ftp follow FTP links from HTML documents.
--follow-tags=LIST comma-separated list of followed HTML tags.
--ignore-tags=LIST comma-separated list of ignored HTML tags.
<strong> -H, --span-hosts go to foreign hosts when recursive.</strong>
-L, --relative follow relative links only.
-I, --include-directories=LIST list of allowed directories.
-X, --exclude-directories=LIST list of excluded directories.
-np, --no-parent don't ascend to the parent directory.
Mail bug reports and suggestions to <[email protected]>.
Note: I've never used the 'quota' option, but am assuming it will stop after downloading a certain number of files.
You can do this through a *nix command line or with PHP (or even Perl, Python Ruby etc.). There's a section on the PHP website with the different methods for invoking commands. Also, if you're going to running commands based on any user input there are security issues to think of.
A simple example of how to invoke wget using PHP is:
exec('wget -rk -w2 http://www.webmaster-forums.net');
'There, that should get you started!
a Padded Cell our articles site!
Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.