Spidering through directories - me again :)
you can easily search thru your own webpage with a script, all you have to do is to open the dirs with opendir...., but i don't know any possibility to do this at another website....
cu
patrick
you can easily search thru your own webpage with a script, all you have to do is to open the dirs with opendir...., but i don't know any possibility to do this at another website....
cu
patrick
Dass posted this at 01:12 — 31st May 1999.
They have: 109 posts
Joined: Apr 1999
In PERL
is there a way to scatter through all a website.com's Directories, Subdomains etc?
IE- say I wanted to look for whatever.zip at
download.com
is there a way for to search through every directory and Subdomain(/win95/somthing else/another thing or somthing.domain.com/somthingelse/)
and look for that query "whatever.zip"?
If you can shed some light that would be great
anti posted this at 20:46 — 31st May 1999.
They have: 453 posts
Joined: Jan 1999
Hi,
using the modules:
LWP::UserAgent
HTTP::Request
HTTP::Response
you can easily retrive a document from a remote web-site.
parsing this document for A-tags in perl is very easy.
The recursing can get a little tricky, but this way you can get at least all documents that are linked from a starting point.
Maybe this will help you:
----------
#!/bin/perl
require LWP::UserAgent;
require HTTP::Request;
require HTTP::Response;
use URI::URL ();
#use strict;
$ua = new LWP::UserAgent;
$ua->agent("Mozilla/4.0" ) ;
$ua->timeout(10);
$ua->proxy('http','http://proxy:80/');
$exclude_files = ".+\.(exeziptgzgzpdftararj)";# all urls matching thie regex will be ignored
get_all("http://62.144.158.186/" ) ;
#get_all("" ) ;
sub get_all
{
my($starturl)=@_;
$urls_next=1;# next one to do
$urls_last=0;# last used
if ($starturl eq "" )# we want to continue
{
load_links("links.dat" );
# FIXME: how do we know where we stopped ???
#$urls_next = ???
}
else# we want to start a new run
{
$referer="";
open (LINKS,">links.dat" );
print LINKS "$starturl;;\n";
close (LINKS);
}
# FIXME:load_links should return if we are done.
#while (eof<LINKSL> )
while (load_links("links.dat" ))
{
#load_links("links.dat" );
do
{
$url = $urls[$urls_next];
$referer = $refs[$urls_next];
$urls_next ++;
print "---> $url\n";
save_url($url,$referer,"page.dat" );
save_links("page.dat",$url,"links.dat" );
sleep 2;
}
while ($urls_last >= $urls_next);
}
}
sub load_links
{
my($file)=@_;
my($old_last);
$old_last = $urls_last;
open (LINKSL,"<$file" )die("hmm\n" );
while (<LINKSL> )
{
chomp;
@parms = split /;/,$_;
$isnew=1;
foreach $url (@urls)
{
if ($url eq $parms[0])
{
$isnew=0;
}
}
if ($isnew)
{
$urls_last ++;
$urls[$urls_last]=$parms[0];
$refs[$urls_last]=$parms[1];
}
}
close (LINKSL);
if ($old_last == $urls_last){ return(0) }
else{ return(1) }
}
sub save_links
{
my($page,$base,$file)=@_;
$base =~ /(.*\/\/.*?)\/.*/;
$baseserver = $1;
$base =~ /(.*\/).*/;
$basedir = $1;
# let's make one long string containing all the tags ... just for fun
$tagstring="";
open(PAGE,"<$page" );
while (<PAGE> )
{
while($_ =~ /(.*?)<(.*?)>(.*)/)
{
$_ = $3;
$tagstring =$tagstring.$2."\n";
}
}
close(PAGE);
# let's make another string containing only the a and img tags ... sure ... we could have done that in the last step.
$linkstring="";
@tags=split /\n/,$tagstring;
foreach $tag (@tags)# images first (to fool banner-programes)
{
if ($tag =~ /.*img.*src.*/i)# do case insensitive matching
{
$linkstring = $linkstring.$tag."\n";
}
}
foreach $tag (@tags)
{
if ($tag =~ /.*a.*href.*/i)
{
$linkstring = $linkstring.$tag."\n";
}
}
# let's extract all urls ... and ... make absolute urls from them ... we could have made another loop, but ...
$urlstring="";
@links=split /\n/,$linkstring;
foreach $link (@links)
{
if ($link =~ /.*src\s?=\s?"(.*?)".*/i)
{
}
elsif ($link =~ /.*href\s?=\s?"(.*?)".*/i)
{
}
my $url = $1;
if ($url =~ /.*mailto:.*/){ $url =""; }
elsif ($url =~ /$exclude_files/){ $url =""; }
elsif ($url =~ /.*http:.*/){ $url = $url; }
elsif ($url =~ /^\/(.*)/)
{
$url = $baseserver.$url;
}
elsif ($url =~ /(.*)/)
{
$url = $basedir.$url;
}
else{ $url =""; }
if ($url eq "" ){}
else
{
$urlstring = $urlstring.$url.";$base;\n";
}
}
open (LINKS,">>$file" );
print LINKS $urlstring;
close (LINKS);
}
sub save_url
{
my($url,$referer,$file)=@_;
my $request = new HTTP::Request 'GET',$url;
$request->referer($referer);
my $response = $ua->request($request,$file);
if ($response->is_success)
{
print now()." ".$response->code()." GET $url ($referer)\n";
}
else
{
print now()." ".$response->code()." GET $url ($referer)\n";
}
}
sub now
{
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) =
gmtime(time);
$year += 1900;
$now=sprintf("%4u-%02u-%02u %02u:%02u:%02u",$year,$mon,$mday,$hour,$min,$sec);
return $now;
}
--------
The code is very crappy and unfinished, but it works and did the job it was used for very well.
ATTENTION: The "crawler" is not site bound at it doesn't check robots.txt (BAD STYLE).
DON'T start it friday evening and go away till monday. you may suck the whole internet
(actually i got about 8gig the first weekend i started this ...)
hope it helps.
AGAIN: DON'T USE THIS SCRIPT AS IS !!!
(sorry for shouting, but it's important.)
ciao
Anti
----------
ps:watch my work in progress at
http://webhome.nu/
Anonymous posted this at 23:11 — 31st May 1999.
They have: 5,633 posts
Joined: Jan 1970
did i ever mention, i hate these smileys
Dass posted this at 00:01 — 1st June 1999.
They have: 109 posts
Joined: Apr 1999
So when completed, this will go through all the directories of the site looking for all the .exe.zipetc ??
Will it also go through links to other sites, and do the same there?
Dass posted this at 19:06 — 1st June 1999.
They have: 109 posts
Joined: Apr 1999
Thanks alot for the help!
I am wondering if anyone has any links or can introduce me to
LWP::UserAgent
HTTP::Request
HTTP::Response
anti posted this at 23:45 — 1st June 1999.
They have: 453 posts
Joined: Jan 1999
Hi,
1. sorry for the I should have double-checked ... : )
2. this script will in fact spider all that is linked (except for the $exclude_files matches)
between
save_url($url,$referer,"page.dat" ) ;
and save_links("page.dat",$url,"links.dat" ) ;
you could add a routine that checks page.dat for your search-string.
after save_links you can add some script that copies the page.dat to save the file (if it was your target-file).
I would like to finish the script as you need it, but I have not much time right now.
Be free to ask.
ciao
Anti
----------
ps:watch my work in progress at
http://webhome.nu/
anti posted this at 23:16 — 2nd June 1999.
They have: 453 posts
Joined: Jan 1999
Hi,
the easiest way to get help for a perl-module usually is to type:
perldoc <module>
perldoc LWP::UserAgent
shows you the functions/methods and some examples. what do you need more.
For perl-info in general try cpan.org
ciao
Anti
kasper posted this at 04:53 — 6th June 1999.
They have: 2 posts
Joined: May 1999
Im having trouble configuring this script...
what variables have to be set up? Where should the links.dat & other files be found?
Any help would be really apreciated!
Greg.
----------
Check out my site...
http://mp34real.cjb.net
anti posted this at 18:34 — 7th June 1999.
They have: 453 posts
Joined: Jan 1999
hi,
I must admit the script is kind of ... poor coded, but it was only intended to show some basics. it was never meant to be used out of the box.
but i'll try to help you:
1.
if you call get_all with an empty string it reads links.dat (see load_links). links.dat contains URL;REFERER pairs.
(some sites/scripts don't give you the file if the referer is wrong.)
2.
a.) yes, it's the starting url, if you specify it it's used . if not the links.dat is read. right now links.dat is redone from the start . maybe we could add just another file which remembers how many lines of links.dat were done.
b.) save_links parses the loaded page (page.dat) for "a href" and "img src"-tags and builds absolute URIs (it sometimes fails, but don't ask me why ?!).
3.
Dass posted this at 21:33 — 9th June 1999.
They have: 109 posts
Joined: Apr 1999
Will this spider also work for FTP also? Because I am using some of the script developing a program that spiders through the links, and looks for ..mainly all the links that are excluded in the spider you wrote heh.
If i was to search through both http and ftp, could i do somthing like
$ua->proxy('http','ftp','http://proxy:80/', 'ftp://proxy:21');
?
anti posted this at 00:09 — 11th June 1999.
They have: 453 posts
Joined: Jan 1999
yes and no.
you would have to change save_url to recognize ftp-urls and donwload files via ftp.
but i don't recommend this approach since ftp is totally different from http. (what netscape does is only an emulation, it really uses ftp.)
you could write ftp-links to an external file and parse it via another script.
at least that's what I would do.
ciao
Anti
ps:
how about "perldoc ftp" ??
Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.