A Java application I wrote that reads several dozen RSS feeds started running into trouble with the W3C. Feeds failed with HTTP 503 "Service Unavailable" errors like this one:
Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
At first I thought this was a temporary error. HTTP 503 errors are defined to indicate that a server is temporarily overloaded or undergoing maintenance.
However, the W3C Systems Team announced in February 2008 that they were dealing with so much traffic for their XML DTD files that they were using 503 errors to deal with bandwidth-hogging XML clients that request the files too often:
... we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years. ...
A while ago we put a system in place to monitor our servers for abusive request patterns and send 503 Service Unavailable responses with custom text depending on the nature of the abuse. Our hope was that the authors of misbehaving software and the administrators of sites who deployed it would notice these errors and make the necessary fixes to the software responsible.
But many of these systems continue to re-request the same DTDs from our site thousands of times over, even after we have been serving them nothing but 503 errors for hours or days.
Although the problem went away for reasons I don't yet understand, I'm looking for a way to read local copies of the XML DTDs with the XOM Java XML library. XOM doesn't yet support XML Catalogs, an XML standard for handling this kind of issue.
When you create a XOM Builder, pass in a org.xml.sax.XMLReader with a custom EntityResolver - the javadocs have examples.