Spidering through directories - me again :)

Anonymous posted this at 21:50 — 30th May 1999.

Joined: Jan 1970

you can easily search thru your own webpage with a script, all you have to do is to open the dirs with opendir...., but i don't know any possibility to do this at another website....
cu
patrick

Dass posted this at 01:12 — 31st May 1999.

They have: 109 posts

Joined: Apr 1999

In PERL
is there a way to scatter through all a website.com's Directories, Subdomains etc?
IE- say I wanted to look for whatever.zip at
download.com

is there a way for to search through every directory and Subdomain(/win95/somthing else/another thing or somthing.domain.com/somthingelse/)
and look for that query "whatever.zip"?

If you can shed some light that would be great

anti posted this at 20:46 — 31st May 1999.

They have: 453 posts

Joined: Jan 1999

Hi,

using the modules:
LWP::UserAgent
HTTP::Request
HTTP::Response

you can easily retrive a document from a remote web-site.
parsing this document for A-tags in perl is very easy.
The recursing can get a little tricky, but this way you can get at least all documents that are linked from a starting point.

Maybe this will help you:
----------
#!/bin/perl

require LWP::UserAgent;
require HTTP::Request;
require HTTP::Response;

use URI::URL ();
#use strict;

$ua = new LWP::UserAgent;
$ua->agent("Mozilla/4.0" ) ;
$ua->timeout(10);
$ua->proxy('http','http://proxy:80/');

$exclude_files = ".+\.(exeziptgzgzpdftararj)";# all urls matching thie regex will be ignored

get_all("http://62.144.158.186/" ) ;
#get_all("" ) ;

sub get_all
{
my($starturl)=@_;

$urls_next=1;# next one to do
$urls_last=0;# last used

if ($starturl eq "" )# we want to continue
{
load_links("links.dat" );
# FIXME: how do we know where we stopped ???
#$urls_next = ???
}
else# we want to start a new run
{
$referer="";

open (LINKS,">links.dat" );
print LINKS "$starturl;;\n";
close (LINKS);
}

# FIXME:load_links should return if we are done.
#while (eof<LINKSL> )
while (load_links("links.dat" ))
{
#load_links("links.dat" );
do
{
$url = $urls[$urls_next];
$referer = $refs[$urls_next];
$urls_next ++;
print "---> $url\n";
save_url($url,$referer,"page.dat" );
save_links("page.dat",$url,"links.dat" );
sleep 2;
}
while ($urls_last >= $urls_next);
}
}

sub load_links
{
my($file)=@_;

my($old_last);

$old_last = $urls_last;

open (LINKSL,"<$file" )die("hmm\n" );
while (<LINKSL> )
{
chomp;
@parms = split /;/,$_;
$isnew=1;
foreach $url (@urls)
{
if ($url eq $parms[0])
{
$isnew=0;
}
}
if ($isnew)
{
$urls_last ++;
$urls[$urls_last]=$parms[0];
$refs[$urls_last]=$parms[1];
}
}
close (LINKSL);
if ($old_last == $urls_last){ return(0) }
else{ return(1) }
}

sub save_links
{
my($page,$base,$file)=@_;

$base =~ /(.*\/\/.*?)\/.*/;
$baseserver = $1;

$base =~ /(.*\/).*/;
$basedir = $1;

# let's make one long string containing all the tags ... just for fun
$tagstring="";
open(PAGE,"<$page" );
while (<PAGE> )
{
while($_ =~ /(.*?)<(.*?)>(.*)/)
{
$_ = $3;
$tagstring =$tagstring.$2."\n";
}
}
close(PAGE);
# let's make another string containing only the a and img tags ... sure ... we could have done that in the last step.
$linkstring="";
@tags=split /\n/,$tagstring;
foreach $tag (@tags)# images first (to fool banner-programes)
{
if ($tag =~ /.*img.*src.*/i)# do case insensitive matching
{
$linkstring = $linkstring.$tag."\n";
}
}
foreach $tag (@tags)
{
if ($tag =~ /.*a.*href.*/i)
{
$linkstring = $linkstring.$tag."\n";
}
}
# let's extract all urls ... and ... make absolute urls from them ... we could have made another loop, but ...
$urlstring="";
@links=split /\n/,$linkstring;
foreach $link (@links)
{
if ($link =~ /.*src\s?=\s?"(.*?)".*/i)
{
}
elsif ($link =~ /.*href\s?=\s?"(.*?)".*/i)
{
}
my $url = $1;
if ($url =~ /.*mailto:.*/){ $url =""; }
elsif ($url =~ /$exclude_files/){ $url =""; }
elsif ($url =~ /.*http:.*/){ $url = $url; }
elsif ($url =~ /^\/(.*)/)
{
$url = $baseserver.$url;
}
elsif ($url =~ /(.*)/)
{
$url = $basedir.$url;
}
else{ $url =""; }
if ($url eq "" ){}
else
{
$urlstring = $urlstring.$url.";$base;\n";
}
}
open (LINKS,">>$file" );
print LINKS $urlstring;
close (LINKS);
}

sub save_url
{
my($url,$referer,$file)=@_;
my $request = new HTTP::Request 'GET',$url;
$request->referer($referer);
my $response = $ua->request($request,$file);

if ($response->is_success)
{
print now()." ".$response->code()." GET $url ($referer)\n";
}
else
{
print now()." ".$response->code()." GET $url ($referer)\n";
}
}

sub now
{
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) =
gmtime(time);
$year += 1900;

$now=sprintf("%4u-%02u-%02u %02u:%02u:%02u",$year,$mon,$mday,$hour,$min,$sec);

return $now;
}
--------

The code is very crappy and unfinished, but it works and did the job it was used for very well.

ATTENTION: The "crawler" is not site bound at it doesn't check robots.txt (BAD STYLE).
DON'T start it friday evening and go away till monday. you may suck the whole internet
(actually i got about 8gig the first weekend i started this ...)

hope it helps.

AGAIN: DON'T USE THIS SCRIPT AS IS !!!
(sorry for shouting, but it's important.)

ciao
Anti

----------
ps:watch my work in progress at
http://webhome.nu/

Anonymous posted this at 23:11 — 31st May 1999.

They have: 5,633 posts

Joined: Jan 1970

did i ever mention, i hate these smileys

Dass posted this at 00:01 — 1st June 1999.

They have: 109 posts

Joined: Apr 1999

So when completed, this will go through all the directories of the site looking for all the .exe.zipetc ??

Will it also go through links to other sites, and do the same there?

Dass posted this at 19:06 — 1st June 1999.

They have: 109 posts

Joined: Apr 1999

Thanks alot for the help!

I am wondering if anyone has any links or can introduce me to
LWP::UserAgent
HTTP::Request
HTTP::Response

anti posted this at 23:45 — 1st June 1999.

They have: 453 posts

Joined: Jan 1999

Hi,

1. sorry for the I should have double-checked ... : )
2. this script will in fact spider all that is linked (except for the $exclude_files matches)
between
save_url($url,$referer,"page.dat" ) ;
and save_links("page.dat",$url,"links.dat" ) ;

you could add a routine that checks page.dat for your search-string.

after save_links you can add some script that copies the page.dat to save the file (if it was your target-file).

I would like to finish the script as you need it, but I have not much time right now.

Be free to ask.

ciao
Anti

----------
ps:watch my work in progress at
http://webhome.nu/

anti posted this at 23:16 — 2nd June 1999.

They have: 453 posts

Joined: Jan 1999

Hi,

the easiest way to get help for a perl-module usually is to type:
perldoc <module>

perldoc LWP::UserAgent
shows you the functions/methods and some examples. what do you need more.

For perl-info in general try cpan.org

ciao
Anti

kasper posted this at 04:53 — 6th June 1999.

They have: 2 posts

Joined: May 1999

Im having trouble configuring this script...

what variables have to be set up? Where should the links.dat & other files be found?

Any help would be really apreciated!

Greg.

----------
Check out my site...
http://mp34real.cjb.net

anti posted this at 18:34 — 7th June 1999.

They have: 453 posts

Joined: Jan 1999

hi,

I must admit the script is kind of ... poor coded, but it was only intended to show some basics. it was never meant to be used out of the box.

but i'll try to help you:

1.
if you call get_all with an empty string it reads links.dat (see load_links). links.dat contains URL;REFERER pairs.
(some sites/scripts don't give you the file if the referer is wrong.)

Quote:
Hi, it's me XXX (didn't know if he would like to be quoted) from the webmaster-forums

Thanks so much for the help so far! If you have time to answer these
questions, that would be great!

get_all("http://62.144.158.186/" ) ;
#get_all("" ) ;

sub get_all
{
my($starturl)=@_;

What is that used for?? Is it the starting URL to search for links?

And finally, how does the spider actually find these links, and keep track
of which is has been to, which it hasn't etc??

a.) yes, it's the starting url, if you specify it it's used . if not the links.dat is read. right now links.dat is redone from the start . maybe we could add just another file which remembers how many lines of links.dat were done.

b.) save_links parses the loaded page (page.dat) for "a href" and "img src"-tags and builds absolute URIs (it sometimes fails, but don't ask me why ?!).

Quote:
Hi, I am trying to get your "spider" script working on my server, I am quite
experienced in configuring, and debugging scripts, but I am not much of a
"writer"

I am having trouble in getting your script to function... How is links.dat
set up? And What modifications should I make to get the script working on
my server...

I want to be able to have a set of admin defined urls spidered, but Im not
sure where the links should be given to the script....

Any help would be greatly appreciated!
[/quote}]

a.)
if you call get_all with an URL the links.dat-file is created and "URL" is parsed for links (file: doesn't work, yet).
b.)
you should at change the ua->proxy setting and mybe the timeout.
depending on what you want to be ignored you should change the exclude_files-regex (btw: there should be pipes "¦" between exe,zip,...).

c.)
you can simply create the links.dat with you favourite editor or via
echo xxxx >links.dat
echo xxxx >>links.dat
and call get_all with an empty string.

so that's it for now.

three things left:
1) please ask in the forum, the answer will be faster.
2) this script is a "HACK" and not more. it's in no way my usual style and i don't give any gurantees.
3) this small input-box sucks.

ciao
Anti

----------
ps:watch my work in progress at
http://webhome.nu/

Dass posted this at 21:33 — 9th June 1999.

They have: 109 posts

Joined: Apr 1999

Will this spider also work for FTP also? Because I am using some of the script developing a program that spiders through the links, and looks for ..mainly all the links that are excluded in the spider you wrote heh.

If i was to search through both http and ftp, could i do somthing like
$ua->proxy('http','ftp','http://proxy:80/', 'ftp://proxy:21');
?

anti posted this at 00:09 — 11th June 1999.

They have: 453 posts

Joined: Jan 1999

yes and no.
you would have to change save_url to recognize ftp-urls and donwload files via ftp.

but i don't recommend this approach since ftp is totally different from http. (what netscape does is only an emulation, it really uses ftp.)

you could write ftp-links to an external file and parse it via another script.

at least that's what I would do.

ciao
Anti
ps:
how about "perldoc ftp" ??