Robot Behaving Badly

Before I can run Urchin to track web traffic, the Apache script split-logfile creates log files for each virtual domain on my server.

A bad request from a webcrawling robot breaks split-logfile:

www.drudge.com//retort.shtml 65.19.150.252 - - [13/Jun/2005:03:50:15 -0400] "GET //retort.shtml HTTP/1.1" 400 323 "http://www.drudge.com//" "OmniExplorer_Bot/1.07 (+http://www.omni-explorer.com) Internet Categorizer"

The error's in the first field: OmniExplorer_Bot should be requesting the page from the host www.drudge.com, not www.drudge.com//retort.shtml.

As it turns out, this bot has bigger problems: It never requests robots.txt, violating the Robots Exclusion Protocol, and will consume hundreds of megabytes crawling a site in one session.

Comments

I had the same problem a while back and since they don't seem to offer much in the way on contact info I decided to simply ban them outright using .htaccess:

RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^(OmniExplorer_Bot|rssImagesBot)
rewriterule .* - [L,F]

(rssImagesBot also behaves badly)

That code causes a HTTP 403 forbidden error for any request they make. Pretty drastic measure I know but I can't see any other choice [unless there's a ban-user-agents thing in apache's config?]

One of my sites contains 21MB of indexable data: Omni-Explorer fetched no less than 195MB of data from that site in one day.

In addition to that, I just watched it attack my regular weblog: 8 simultaneous requests, not stopping for breathing at all.

My method of blocking this is a bit simpler than Mark's, btw:

SetEnvIfNoCase User-Agent "OmniExplorer_Bot" NotWanted
Deny from env=NotWanted

However, I am considering redirecting it to Omni-explore.com with 301 Moved Permanently, so it can gobble some of it's owner bandwith

Something About Shakespeare

I think it can be fairly said, as Gore Vidal did, that almost any schoolboy in England has a better command of the English language than George W. Bush.

Regards,

The Spirit of Alexander the Great

Lives on in the union of men of Good will everywhere.

Let us unite and defeat the masters of Iniquity,

And put to rest their vainglorious attempt to usurp the Throne of All Mighty God.

Glory Hallelujah!

needless to say about the omni-crawler it is giving more and more visits but no visitors

I just can't imagine, how this gays (bot creators) skept robot.txt analizing! I thinks, they should remade the bot, or or it should be totaly banned from all servers.

Add a Comment

All comments are moderated before publication. These HTML tags are permitted: <p>, <b>, <i>, <a>, and <blockquote>. This site is protected by reCAPTCHA (for which the Google Privacy Policy and Terms of Service apply).