.htaccess ban list

Busy posted this at 03:27 — 1st October 2002.

Joined: May 2001

Anyone use a .htaccess ban list to stop crawlers?

here is a good list of them http://www.psychedelix.com/agents.html
here is a site to test your file on line robots.txt http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

and here is what I've pulled off the net, anyone know of any others to include, these are mostly download managers, email and image harvesting robots

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^attach [OR]
RewriteCond %{HTTP_USER_AGENT} ^BackWeb [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bandit [OR]
RewriteCond %{HTTP_USER_AGENT} ^BatchFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Buddy [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Copier [OR]
RewriteCond %{HTTP_USER_AGENT} ^DA [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo\ Pump [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Wonder [OR]
RewriteCond %{HTTP_USER_AGENT} ^Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^Drip [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FileHound [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetSmart [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^gotit [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^Iria [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC [OR]
RewriteCond %{HTTP_USER_AGENT} ^JustView [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^lftp [OR]
RewriteCond %{HTTP_USER_AGENT} ^likse [OR]
RewriteCond %{HTTP_USER_AGENT} ^Magnet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mag-Net [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^Memo [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mirror [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^MS\ FrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZip [OR]
RewriteCond %{HTTP_USER_AGENT} ^Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^Pockey [OR]
RewriteCond %{HTTP_USER_AGENT} ^Pump [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^Reaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Recorder [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Siphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^Snake [OR]
RewriteCond %{HTTP_USER_AGENT} ^SpaceBison [OR]
RewriteCond %{HTTP_USER_AGENT} ^Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Vacuum [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website [OR]
RewriteCond %{HTTP_USER_AGENT} ^Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Whacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon
RewriteRule /*$ http://www.site-you-are-sending-the-bot-to.com [L,R]

Note this list is untested by me, so use it at your own risk

ROB posted this at 07:55 — 1st October 2002.

They have: 447 posts

Joined: Oct 1999

heh, how strange. i spent the last 4 hours putting together a 'robot agent keyword' list for some stat tracking sofware im working on. I seriously just finished (or gave up) and come here to find a thread on the subject, very wierd (are you reading my mind?)

Anyways, here's my list. Note i converted everything to lowercase (because i store useragents in lowercase) and the existence of any of these strings (case-insensitive) in a user-agent should, in theory, indicate a crawler. this is all un-tested also as i just finished this list.

aitcsrobot
ao/a-t.idrg
arachnoidea
architextspider
atomz
auresys
awapclient
axis
backrub
bayspider
big brother
bjaaland
black widow
blackwidow
borg-bot
bspider
cactvs chemistry spider
calif
cern-linemode
checkbot
christcrawler
cienciaficcion
combine
computingsite
conceptbot
coolbot
cosmos
crawler
crawlpaper
cusco
customcrawl
cyberpilot
cyberspyder
deweb
diagem
die blinde kuh
dienstspider
digger
digimarc
digimarc cgireader
diibot
dittospyder
dlw3robot
dnabot
dragonbot
duppies
ebiness
eit-link-verifier
elfinbot
emc spider
esirover
esismartspider
esther
evliya celebi
explorersearch
fast-webcrawler
fastcrawler
fdse
felix
fido
fish-search
flipper
folio
foobar
fouineur
freecrawl
funnelweb
g.r.a.b.
gammaspider
gazz
gcreep
gestalticonoclast
getterroboplus
geturl.rexx
glimpse
golem
googlebot
grabber
griffon
gromit
grub-client
gulliver
gulper
gulper web bot
harvest
havindex
hazel's ferret web hopper
hku www robot
hometown spider pro
htdig
htmlgobble
hämähäkki
i robot
ia_archiver
iagent
iajabot
ibm_planetwide
image.kapsi.net
imagelock
imagescape
incywincy
industry canada bot
informant
infoseek
infospiders
ingrid
inspectorwww
internet cruiser
iron33
ispi
israelisearch
javabee
jbot
jcrawler
jeeves
jobo
jobot
joebot
jubiirobot
jumpstation
katipo
kdd-explorer
kit-fireball
kit_fireball
ko_yappo
labelgrab
larbin
legs
libertech-rover
linecker
linkidator
linklint
linkscan
linkwalker
lmtaspider
lmtasspider
lockon
logo.gif crawler
lycos
lycos_spider
magpie
mediafox
mercator
merzscope
mindcrawler
moget
momspider
monster
motor
mouse.house
muscatferret
mwdsearch
nec-meshexplorer
nederland.zoek
netcarta
netcarta_webmapper
netmechanic
netscape-catalog-robot
netscoop
newscan-online
nhsewalker
nomad
northstar
occam
open text
openbot
openfind
opilio
orbsearch
packrat
pageboy
parasite
patric
pbwf
pegasus
peregrinator
perlcrawler
pgp-ka
phpdig
piltdownman
pimptrain
pioneer
plumtreewebaccessor
poppi
portalbspider
portaljuice
psbot
pybot
raven
resume robot
rhcs
road runner
robbie
robocrawl
robodude
robofox
robot
robot du crim
robozilla
roverbot
rules
sabic
safetynet
scooter
searchprocess
senrigan
sg-scout
shagseeker
shai'hulud
sharp-info-agent
sidewinder
simbot
site valet
sitetech
sitetech-rover
slcrawler
sleek
slurp
snooper
solbot
spanner
speedy
spider
spiderbot
spiderline
spiderman
spiderview
spyder
squirrel
ssearcher
suke
suntek
superewe
t-rex
tarantula
tarspider
techbot
templeton
teoma_agent1
titan
titin
tlspider
ucsd-crawler
udmsearch
url spider pro
urlck
valkyrie
verticrawl
victoria
vision-search
voyager
vwbot_k
w3crobot
w3index
w3m2
w3mir
w@pspider
wallpaper
web21
webbandit
webcatcher
webcopy
webcrawler
webfetcher
weblayers
weblinker
webmoose
webquest
webreader
webreaper
webs
webvac
webwalk
webwalker
webwatch
webzone
whatuseek_winona
wired-digital-newsbot
wisewire
wlm-
wolp
wwwc
wwwwanderer
xget

ROB posted this at 08:02 — 1st October 2002.

They have: 447 posts

Joined: Oct 1999

i should combine mine with yours Busy, looks like you got some i didnt, and vice versa.

I can't use the robots.txt directives in my application, but you certainly can with your webserver. Create a file called robots.txt in your web document root and include the lines (not positive on this, from memory):

User-Agent: *
Disallow: /

And voila, any crawlers (that respect the robot exclusion directives) will skip your site.

Busy posted this at 08:57 — 1st October 2002.

He has: 6,151 posts

Joined: May 2001

Rob, you've got a lot of search engines in your list, I'm just trying to ban the bad ones from leeching email addresses, images and site files.

but certain ones like ia_archiver I'll add, it only sucks files and from what I have found through sites has been in legal battles from being aggressive bandwidth theft.

Problem with using the robots file, most of these wont read or even look at it, let alone obey it. I've also set up a harvestor trap to get the extra ones I've missed, then can ban them by IP.

ROB posted this at 09:27 — 1st October 2002.

They have: 447 posts

Joined: Oct 1999

ah, didnt realize you were only trying to block the naughty ones. one thing though, the naughty ones probably change theyre agent name regularly also. hell i wouldnt doubt if some of them created a UA name dynamically. Another method to stop all bots is to only allow known legitimate user-agent strings. Unfortunately, crawlers can send any query string they want, but most identify themselves in the agent.