Put Aggregators on a Diet

As a host containing thousands of weblogs, Weblogs.Com has to deal with one of the big scaling issues with syndication feeds: Once an aggregator subscribes to a feed, it could be checking the file multiple times a day, even when the site hasn't changed in years.

For example, Java.Weblogs.Com hasn't been updated since 2001. A single user who subscribes to its RSS feed could be requesting that 13K file a dozen or more times a day. If the site has 20 subscribers, they could potentially be using 144 megabytes of traffic a month requesting that file.

There are a number of ways to address this problem, and feed providers probably need to adopt all of them.

Charles Miller has written a tutorial on supporting conditional get requests in HTTP. These requests enable a client to avoid receiving a file that it already has been sent in the past, using ETag and Last Modified headers. A client can find out in a 200-byte request that the feed has not changed.

Looking at my logs, it appears that a bunch of popular aggregators support conditional requests, including Bloglines, NetNewsWire, Newz Crawler, NewsGator, RssBandit, SharpReader, and Wildgrape NewsDesk. Radio UserLand does not.

When I request a feed from Weblogs.Com with an HTTP viewer, it looks like Frontier, the server hosting the site, does not send a Last-Modified header. If it did, the traffic burden on the site from moldy RSS feeds should be reduced considerably.

The feed recommends an XML-based redirect that indicates the feed will never change again by redirecting it to nowhere.

If you're going to point a dead site's feed to nowhere, why not simply delete it? Any decent aggregator will eventually dump a feed that results in "file not found" errors.

There's also a way to slow these requests to a trickle in RSS 2.0: When a feed should be considered inactive, set the skipHours and skipDays elements to block everything but one hour a week.

Though some aggregators may not support these elements yet, as bandwidth consumption becomes more of an issue, they are likely to adopt an established part of the RSS format. Also, this makes it possible to occasionally check one of these feeds to see if the publisher resumes activity in the future.

Comments

While I agree that aggregators ought to have some way of dealing with 404, I'm not so sure they do. I suspect (based purely on memory, not testing) that for the most part they will refuse to subscribe to a 404 (even though I may know that the URL will exist in the future, and want to subscribe now to be ready for when it does exist), but will continue checking it, with or without some indication that there's a problem, for all time. Certainly invisibly dropping it would be a horrible solution: people who shut their site down for a while, or have it shut down due to hosting problems, may be 404'd for any length of time, but keeping an eye out for their return is exactly the sort of task that my aggregator should handle for me. The only things I can see working are aggregators adaptively backing down on the update frequency, so that a couple days of 404s shifts them to 2n, a week shifts them to 12n, etc, and producers using unambiguous, clearly not accidental HTTP status codes like 410 when they really mean "never ask me for that again, fool!"

OT: your [comments] link in your feed is a touch over-escaped: unless I missed the memo, the content-model for [comments] is still text, not text/html, so double-escaping ampersands is once too many, and the end result is a URL that doesn't work.

You can also set the TTL to once a month or something like that.

I wondered if the ttl element could be used to help alleviate bandwidth issues, but the language of the RSS 2.0 spec would appear to prevent it.

The spec says that it shows "how long a channel can be cached before refreshing from the source," not how long it must be cached.

I'll check that comments link. I need to move my comments in house anyway and serve them with PHP/MySQL.

From my own experimentation, the TTL is observed by some RSS clients. It would be interesting if someone provided a list of those that used the element to slow polling.

Or, better yet, a chart of the aggregators/method-of-dealing-with-this.

That there are available solutions, which should get better..

..wouldn't you think that'd shut up those of the argument polling is a no-go?

Btw, 144 megs a month... How big are the 500,000,000 movies Disney sent LAST year using RSS2?

Another variation: smarter aggregators that notice that nothing new has happened and either throttle down or prompt the user to unsubscribe.

Sounds like a winner ta-me, Andrew!

Best I've seen so far.

As The Scoble pointed out, more-or-less, if you change your polling rate from once every 15 minutes to once a day, you just bulked up your polling capability a hundred-fold.

(That'd be a couple-orders-of-magnitude, for the benefit of The Scoble and others of Generation-Calculator...;-)

Mebbe laterz, elsewherezzzz...

Andrew hits nail on head. This certainly wouldn't be any harder than supporting another XML language like 'redirect'.

This is probably a little harder to implement but anyway...Why not make aggregators aware of the average feed update frequency and tune up the scan frequency to that? This in general would save bandwidth. Setting a max to 1 month would also allow the re-birth of almost dead feeds. (More details on my weblog)

Ramping up and down on polling frequency may be needed, but I think a lot of users would hate it.

Many webloggers publish sporadically. What if one of your favorites stops posting regularly for a couple of months and then posts a huge announcement that you've been waiting for upon his return? Do you really want to be the last one to get the news because your aggregator is smart? The immediacy of syndication, compared to the difficulty of checking sites manually, is one of its selling points.

Rogers, I agree with you but as I say on my original post I'd make this feature selectable on a feed basis and in addition I'd exclude exceptional delays just for the reason you mention: if a feed has let's say an average of one update a day I'd keep that frequency also for polling: when an update occurs, I'd look for the new average frequency from that moment on and, in case it's restored I'd simply ignore the statistic exception. I'd be not aggressive in ramping down the polling frequency: this way it would affect only long-sleeping weblogs and only in the long period.

badge,heaved.Danization kidney Africanizations tentacle?doomed dancing messier tenor? room free texas no hold em [url=http://www.uniquetexasholdem.com/free-texasholdem.html] room free texas no hold em [/url] room free texas no hold em http://www.uniquetexasholdem.com/free-texasholdem.html http://www.uniquetexasholdem.com/free-texasholdem.html crib bitterroot em free hold strategy texas [url=http://www.uniquetexasholdem.com/] em free hold strategy texas [/url] em free hold strategy texas http://www.uniquetexasholdem.com/ http://www.uniquetexasholdem.com/ atomizes tragically Eastland relegating Blakey play for free texasholdem [url=http://www.uniquetexasholdem.com/] play for free texasholdem [/url] play for free texasholdem http://www.uniquetexasholdem.com/ http://www.uniquetexasholdem.com/ medallions

Add a Comment

All comments are moderated before publication. These HTML tags are permitted: <p>, <b>, <i>, <a>, and <blockquote>. This site is protected by reCAPTCHA (for which the Google Privacy Policy and Terms of Service apply).