Robot Behaving Badly

Before I can run Urchin to track web traffic, the Apache script split-logfile creates log files for each virtual domain on my server.

A bad request from a webcrawling robot breaks split-logfile: - - [13/Jun/2005:03:50:15 -0400] "GET //retort.shtml HTTP/1.1" 400 323 "" "OmniExplorer_Bot/1.07 (+ Internet Categorizer"

The error's in the first field: OmniExplorer_Bot should be requesting the page from the host, not

As it turns out, this bot has bigger problems: It never requests robots.txt, violating the Robots Exclusion Protocol, and will consume hundreds of megabytes crawling a site in one session.


I had the same problem a while back and since they don't seem to offer much in the way on contact info I decided to simply ban them outright using .htaccess:

RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^(OmniExplorer_Bot|rssImagesBot)
rewriterule .* - [L,F]

(rssImagesBot also behaves badly)

That code causes a HTTP 403 forbidden error for any request they make. Pretty drastic measure I know but I can't see any other choice [unless there's a ban-user-agents thing in apache's config?]

One of my sites contains 21MB of indexable data: Omni-Explorer fetched no less than 195MB of data from that site in one day.

In addition to that, I just watched it attack my regular weblog: 8 simultaneous requests, not stopping for breathing at all.

My method of blocking this is a bit simpler than Mark's, btw:

SetEnvIfNoCase User-Agent "OmniExplorer_Bot" NotWanted
Deny from env=NotWanted

However, I am considering redirecting it to with 301 Moved Permanently, so it can gobble some of it's owner bandwith

needless to say about the omni-crawler it is giving more and more visits but no visitors

I just can't imagine, how this gays (bot creators) skept robot.txt analizing! I thinks, they should remade the bot, or or it should be totaly banned from all servers.

