Following Web Page Redirects with Java

CNET moved a bunch of its blogs to a different domain this weekend, including Beyond Binary, Coop's Corner, Geek Gestalt, One More Thing, Outside the Lines and The Social. I mention this because the change hosed Meme13, which treated all six as if they were newly discovered sites.

One of my ground rules for developing Meme13 is that I won't hand-edit the site to make it smarter. I need the application to recognize when existing sites in its database have moved.

Meme13 monitors sites using a Java application I wrote that downloads web pages with the Apache HTTPClient 3.0 class library. Web servers indicate that a page has moved by sending an HTTP redirect response of either "301 Moved Permanently," which indicates a permanent move, or "302 Found," which is intended for temporary changes. I wrote a Java method that can find the current location of a web page, even if it has been redirected one or more times:

public String checkFeedUrl(String feedUrl) {
    String response = feedUrl;
    HttpClient client = new HttpClient();
    HttpMethod method = new HeadMethod(feedUrl);
    method.setFollowRedirects(false);
    try {
        // request feed
        int statusCode = client.executeMethod(method);
        if ((statusCode == 301) | (statusCode == 302)) {
            // feed has moved
            Header location = method.getResponseHeader("Location");
            if (!location.getValue().equals("")) {
                // recursively check URL until it's not redirected any more
                response = checkFeedUrl(location.getValue());
            }
        } else {
            response = feedUrl;
        }
    } catch (IOException ioe) {
        response = feedUrl;
    }
    return response;
}

The HeadMethod class requests a web page's headers instead of requesting the entire page, consuming far less bandwidth as it checks for redirects. My Java method looks for both kinds of redirects, because web publishers have a bad habit of using "302 Found" when they've moved a page permanently.

Add a Comment

All comments are moderated before publication. These HTML tags are permitted: <p>, <b>, <i>, <a>, and <blockquote>. This site is protected by reCAPTCHA (for which the Google Privacy Policy and Terms of Service apply).