WordPress and Dave Winer are working together to bring real-time, Twitter-style updates to RSS feeds using the cloud element and the accompanying RSSCloud Interface. Yesterday, WordPress added RSS cloud support to "all 7.5 million blogs on WordPress.com." Winer's documenting the ongoing work at RSSCloud.org.
Although some tech sites are reporting this as a new initiative, cloud has been around since RSS 0.92 in December 2000. I was getting real-time RSS updates as a Radio UserLand blogger back then, and it was a great feature.
However, there's a reason that UserLand turned off cloud support in its products several years ago and shut down all of its cloud notification servers. The approach has massive scaling and firewall issues.
To explain why, it's worth looking at an example. I publish the Drudge Retort, which has around 16,000 subscribers, including 1,000 who get the feeds using desktop software on their home computers. If I add cloud support and all of my subscribers have cloud-enabled readers, each time I update the Retort, my cloud update server will be sending around 1,050 notifications to computers running RSS readers -- 1,000 to individuals and 50 to web-based readers.
That's just for one update. The Retort updates around 20 times a day, so that requires 21,000 notifications sent using XML-RPC, SOAP or REST.
On Internet servers it's extremely expensive to request data from clients, in terms of CPU time and networking resources. You have to make a connection to the computer, wait for a response and deal with timeouts from servers that are unavailable or blocked by a firewall.
RSSCloud also requires that all desktop software receiving cloud notifications functions as a web server. So if an RSS reader like BottomFeeder or FeedDemon adds cloud support, it must show its users how to turn off firewall ports to accept these incoming requests and possibly turn them off in their router as well. UserLand's attempt to put web servers on user desktops failed because it was too cumbersome to support. Back when I was writing the book Radio UserLand Kick Start and working closely with UserLand developers, their biggest customer service issue was helping users open up their firewalls so that Radio UserLand could act as a web server.
I don't mean to be a dark cloud, because this functionality could be a nice improvement for web-based RSS readers, letting services like Google Reader and Bloglines receive much quicker updates than they get from hourly polling.
But if the effort to make RSS real time extends to desktop software and mobile clients, cloud won't work. I think that RSS update notification would require peer-to-peer technology and something like XMPP, the protocol that powers Jabber instant messaging.
An interesting thing Winer mentions is how "real time RSS support" would be way beyond the capabilities of Twitter, but the same thing could be done via RSSCloud on a distributed basis. But it did make me wonder "if the total burden is so massive, how much of it can the individual webmaster really handle"?
So for people like me, whose blog is a well kept secret (hosted on Dreamhost), would using that RSSCloud plug-in be a bad move? Or would it become a problem only if I finally started getting a following?
More lack of internet memory --- am I the only one who remembers the hue and cry because aggregators were shipping with unfriendly refresh schedules, running up everyone's bandwidth?
Am I reading the spec correctly in that the subscriber ping just tells the aggregator to go refetch the feed, right? Not the updated item, but everything, compounding the resources used.
Wendell: I think all of the scaling problems are on the server sending out the cloud notifications. A blog that tries the RSSCloud plug-in shouldn't suffer any performance issues.
We've implemented RSS Cloud in a way that I'm pretty confident will scale for all our users. We already send out millions of emails, pingbacks, trackbacks, and more every day, it's just another outgoing request.
If XMPP is your flavor, I'd highly encourage you to check out im.wordpress.com which has information for how to subscribe to any WP blog or comment stream via Jabber and get instant notifications and full content.
Everything old is new again.
I'll set up my blog to ping RSSCloud right after I finish checking out my share.opml.org stats.
Rogers, you have qualms about the "massive scaling and firewall issues." I am inclined to think that Automattic knows about scaling. WordPress.com has millions of blogs, with trackback and all manner of other things.
The term RSSCloud annoys me (see
Now I'm going to wonder all day. What am I supposed to see?
If Dave wants his solution to effectively compete with Twitter I'd expect that he'll need the support of all those Flash-based twitter clients that power users have come to rely on. The last time I looked you could not create an http server using a Flash plug-in or an Adobe Air application.
The difference now versus 5-10 years ago is that
a) There are millions of people using web based RSS readers that can be notified with a few pings.
b) Hosting an RSS aggregator is dirt cheap on any cloud VPS as opposed to thousands of dollars a month back then.
I could be wrong, but I believe at this point there is 1 RSS Reader that supports RSSCloud, and that is River2, which I suspect a fraction of a fraction of a percent currently use. Most desktop RSS readers will never support RSS Cloud simply because it requires a non-firewalled web accessible connection, something very few internet users currently have or will go through the hassle of setting up.
The beauty of RSS Cloud is that it isn't a requirement to read the feed, but rather icing on the cake. Any sysadmin worth a darn would recognize the burden thousands of pings puts on their server every time a post is made and will look at who really need the real-time updates. To start, you would disable all notifications except for ones for the big web based RSS Readers (ie Google Reader) as that will take care of thousands of users with 1 ping, and from there figure out what your server can sustain. Average Joe sitting in his living room that has set up River2 and punched a hole in his firewall to allow his MacBook to receive RSS Cloud updates will just have to sit tight and wait for his RSS Reader to update manually.
The reason RSS Cloud failed to catch on up until recently was because people could have cared less about, or even understood the need for realtime RSS/ATOM communication. Real-time becomes important when the amount of chatter greatly outnumbers the ability for people to consume even a tiny portion of it.
Rogers, take a look at PubSubHubBub code.google.com . The protocol has similar to goals to rssCloud, but addresses issues related to scaling and firewalls. Google Reader, Blogger and other Google properties support the protocol.
Solution? Get yourself a VPS at the Rackspace Cloud (formerly Mosso) for $10 / month and set that up to send your notifications. It's more than enough to handle a few hundred thousand pings / day.
I thought the point of RSSCloud is that the blogs ping the cloud server with updates, while the mobile, flash, and desktop AIR clients initiate a single outbound connection to the cloud server waiting for updates to all of their subscriptions. In other words, clients need only a single outbound connection (with no firewall issues to worry about) and blogs only need to ping a few cloud servers. The cloud server is accepting inbound connections in both cases. The cloud server still needs to deal with 1000s of inbound connections, but that is (hopefully?) a straightforward scaling issue.
Here's my understanding of the potential for a scalability problem.
EXAMPLE: twitter.com has 24,650 twitter followers.
If Dave gets 24,650 followers on an RSSCloud architecture then this
is what happens when he posts.
1. Dave creates a 140 char post. Hs blogging software sends a notice to the cloud server that he has an updated RSS feed.
2. the cloud server sends update notices to the 24,650 subscribed "listeners" to Dave's "RSSCloud-twit-sream".
NOTE:It does not send Dave's new post text just the alert event.
3. the 24,650 listeners then do an RSS GET from Dave blogging server. This could create "cattle stampede" (i.e. slashdot effect) and many users may not get service when Dave's server is overrun. The server would likely be swamped with this massive interest in Dave's blog in a few seconds from these real-time subscribers.
At small levels of users the architecture is effective and elegant. At very large numbers it's missing an essential optimization. Only the "new blog" text should needs to be sent... maybe with the RSSCloud event for example.
An RSS GET will pull the whole string of recent blogs posts for all 24,650 users. A *lot* of excess text that most users already have from being real-time listeners anyway.
The RSSCloud Blogger's software needs to see a difference between a RSS GET for the recent blog text and an RSSCloud GET for the latest update text ONLY. Reducing the amount of text being sent out but a change to the protocols as described I think.
Of course, I could be *way off base* but I'm really trying to understand the overall architecture and the "realtime" problem this is intended to resolve for us all.
NOTE: If you federate the RSSCloud servers you just make the "GET" problem even worse. More demand on the blogger's RSS feed in a few seconds. It's like a user driven "slashdot effect". Post a 140 char message and notify the cloud and *boom*... you're server falls over.
I'll await corrections to my understanding.
PubSubHubBub has an entirely different approach to the real-time optimization for bloggers. The Hub Server gets the blogger's new post text and the Hub Server forwards this delta to subscribed listeners. The Blogger's server never sees any excess traffic in or out. Of course, the PubSubHubBub service could require the resources of a Google, Amazon or Yahoo. A centralized service that could potentially have a "fail whale". Dave's RSS Cloud has a million "fail fishies".
Life as always is rife with tradeoffs. Go figure. YMMV.
That's how PubSubHubbub works; RSSCloud has the TCP connection going in the wrong direction. That means not only do you have to bust firewalls, but the server is continually timing out clients that have disappeared and (hopefully) re-subscribed from other IPs. It's just not manageable in the tangle that is the actual consumer web.
Plus, like MCDTracy says, RSSCloud doesn't even include the actual content in the outbound pings. The potential for things to go wrong boggles the mind.
The beauty of RSS Cloud is that it isn't a requirement to read the feed, but rather icing on the cake. Any sysadmin worth a darn would recognize the burden thousands of pings puts on their server every time a post is made and will look at who really need the real-time updates.
If one of the key reasons for RSSCloud is to enable a decentralized Twitter-like network to spring up, notification isn't optional for that usage.
I thought the point of RSSCloud is that the blogs ping the cloud server with updates, while the mobile, flash, and desktop AIR clients initiate a single outbound connection to the cloud server waiting for updates to all of their subscriptions.
Nope. If I subscribe to 100 feeds with a cloud-enabled aggregator, my aggregator is waiting for update notifications from (potentially) 100 different cloud servers. It isn't keeping a connection open to every cloud server. It waits to get an XML-RPC, SOAP or REST request that signifies an update.
An RSS GET will pull the whole string of recent blogs posts for all 24,650 users. A *lot* of excess text that most users already have from being real-time listeners anyway.
Yup. If I update a Twitter-like blog 10 times in an hour and have 2,000 cloud-enabled subscribers, they make 20,000 requests of the full feed just to get the most recent update. If my blog feed is 15K in size, I've served up 286 megabytes of bandwidth in that hour.
I'm going to do some digging into PSHB to see how it addresses these concerns.
PubSubHubBub gets the update from the Blogger and saves the text delta of the feed.
When the subcriber is notified the text delta is passed along in the data payload.
The amount of text moved and the focus on building a Hub Server is a benefit to the blogger, the subscriber and a burdon for the Hub Service... an obligation to deliver an effective service that never fails.
Companies that are already processing "pings" from millions of blogs welcome the change to the polling paradigm. They already have tried to scale to make blog deltas a realotime service and found it just couldn't scale. PSHB was their take on fixing the problem.
NOTE: PSHB works with RSS-based blogs. There is *no* mandate to change the blogger to use Atom. A "Hat Tip" to the reality of web inertia.
Roger (not Rogers), as our self-anointed historian, could you please recount the the casualties of the first great RSS aggregator invasion that you think fell in vain?
I remember the fear. I don't remember much though in the way of casualties. People adjusted the default retry intervals their aggregators shipped with and implemented the client side of Etag's and proper HTTP HEAD requests, server side software followed suit. The rough edges weren't fixed overnight, but that was fine, since RSS didn't take off overnight, and even with Automattic's support, RSSCloud isn't going to either.
As for those worried about all the poor webservers just getting hammered every time an update notification goes out to the cloud, is that really an issue? I mean, event driven webservers like nginx or lighttpd can retire something like 10K requests a second on relatively modest hardware and support thousands of concurrent at a time out of a few MB of memory. Yes, that throughput is for static files, but just how often is that RSS feed changing? Even if your RSS feed is served dynamically, you can put nginx in front of apache as a reverse proxy or whatever and set up a rule to cache requests to whatever your RSS feed URL is for 1s.
As for the strain caused by delivering the notifications themselves, the same techniques that have made it possible to serve thousands of requests a second from a modest server are applicable for sending notifications, thought, at this point. Someone just has to write a cloud server that uses them. They can probably start with the software script-kiddies use to send spam :)
The critique of the firewall issues, etc, are the only ones that make sense to me. It seems like that needs to be turned around to use one of the "comet" techniques.
I haven't dug deeply into any of the proposed solutions, but upon reading about RSSCloud and PSHB, I kept thinking that FeedSync had already established a good foundation for a solution to this problem. It may not be perfect, but it's certainly more fully baked than either RSSCloud or PSHB.
"the 24,650 listeners then do an RSS GET from Dave blogging server. This could create "cattle stampede" (i.e. slashdot effect) and many users may not get service when Dave's server is overrun. The server would likely be swamped with this massive interest in Dave's blog in a few seconds from these real-time subscribers."
You are missing the fact that the publisher controls when the client receives the ping. First, the publishing server cannot make 5,000 simultaneous calls, there is bound to be some lag there. Second, that lag can be determined by the pinging server and can be spaced out to an acceptable period.
Instead of pinging 5,000 clients in 1 second, you would be better to ping 500 clients / minute spaced out over 10 minutes. "Oh... but that's not realtime!" Then get a better server that can handle 10k requests a minute, or... hint hint... let Google take care of it for you as an aggregator.
"An RSS GET will pull the whole string of recent blogs posts for all 24,650 users. A *lot* of excess text that most users already have from being real-time listeners anyway."
Who cares? That's no different than how it currently works. If the feed cannot be dynamically generated, then serve a static file or limit your RSS feed to the most recent oh... 5 posts instead of 25.
"If one of the key reasons for RSSCloud is to enable a decentralized Twitter-like network to spring up, notification isn't optional for that usage."
It's not like Twitter is 100% realtime for the heavy users. Most people get updates within 5 minutes of being posted depending on their Twitter client settings.
The key I think a lot of the naysayers are missing is the "Cloud" component. Scaling of this system isn't designed for 1 server to talk directly to all 5 million subscribers, it is designed to talk to 500 "Clouds", which then serve 50,000 users each. Current "Clouds" could be considered Google Reader or FeedBurner (if they would build it into the product). See Dave's diagram at http://images.scripting.com/archiveScriptingCom/2009/07/22/schema.gif
In the end, the system isn't perfect, but nothing of this nature can be. You can't magically syndicate millions of tweets across the internet instantly over a distributed network without lag or load issues.
If you care to read more about my perspective, I wrote a response to this post @ The Reason RSS Cloud -Will- Catch On.
Thanks for your input. We're not trying to hurt RSS Cloud. We're just taking a look at how it's designed.
"a lot of the naysayers are missing is the "Cloud" component. Scaling of this system isn't designed for 1 server to talk directly to all 5 million subscribers, it is designed to talk to 500 "Clouds", which then serve 50,000 users each."
Look at Dave's docs carefully. Find this statement in the process.
"The aggregator then reads the feed, finds the new stuff and informs the Reader."
That critical step is not centralized it's just an RSS GET as always.
If a blogger talks to 500 Clouds then every subscriber to that blogger across all those clouds "reads the feed" in real-time.
The Cloud(s) *only address(es)* a distributed "ping" event system.
It does not take the stress off the bloggers site as (s)he adds more and more followers. It can't scale without a change to the specs.
Our blogs are time-shared instances in yet another type of Cloud like Wordpress.com.
"We're not trying to hurt RSS Cloud."
Some is of course constructive criticism, others have just been stomping on the idea (here and elsewhere).
"If a blogger talks to 500 Clouds then every subscriber to that blogger across all those clouds "reads the feed" in real-time."
I may be mis-interpreting your response, so let me know if I am. The way it works is the end-user reads from the aggregator's (the Cloud's) cached version of the feed, not the publisher's original feed. Content is relayed to end-users, ping requests are not. If I send out 500 pings, I should only get (roughly) 500 GETs in return. Those 500 Clouds could then relay content to 5 million people.
Simply put, all this system does is tell Google Reader (and other Clouds) when to pull for fresh content instead of relying on Google pulling stale copies over and over, occasionally finding something fresh.
What you describe would work well up to a point. In fact it's very close to what RSS has always had. Our blogs "ping" a ping service and the large aggregators subscribe to the "ping" service (like ping-o-matic or weblogs.com).
In practice the ping services and industry hardened aggregators didn't reflect changes to blogs in seconds... they slipped overtime to minutes as millions of blogs came online and the volume of traffic grew and grew.
RSS Cloud doesn't solve the bottlenecks in this scheme it just creates a coral reef of new "ping servers" IMHO.
It will scale perfectly well for departmental uses. It's just not a twitter killer in it's present form. To create a distributed system that could federate messaging in real time you need to cache the messages (as you indicate) so the real-time transfer of that message scales.
If we see a federated network of RSS Cloud servers *AND* industrial strength Aggregators that cache the messages in multiple locations then Dave's scheme might have some benefits but it's really a simple tweak on the ping protocol and not a real-time twitter killer.
PubSubHubBub was created by coders that run very large aggregators to solve their problems: polling millions of blogs for that changed text they know is waiting for an update in their service... polling millions for changes.
Imagine if the million blogs "pushed" the 140 character deltas to aggregators. Now that's an idea. One push to serve millions.
PSHB's solution to the ping backlog is for the Blogger software to send the *new text* along with the event so the aggregator can simply cache the update. That's an idea that can scale.
To optimize a system: reduce waste, duplication and excess steps.
Dave's solution doesn't optimize the system... it just distributes one piece of it: the ping.
Distributing the "pings" is only a solution to dependency upon a few ping servers. That's a good thing but including the 140 character data payload *with* the ping notice would have been a killer. It's not too late for Dave to tweak the spec. Let's hope he does.
He has started deleting text in the conversation and will likely start targeting his "critics". We're not criticizing Dave Winer but highlighting the design flaws in the protocol design. Users will get confused and developers will simply slip away from the meltdown in civility that can follow if Dave cries "They're attacking me... again."
People that fail to learn from history are doomed to repeat it.
Great software design is tested in a laboratory of ideas and implementations. Let the tesing begin maybe the internet will change to fit the problem and cache the deltas effectively and I'm full of shit. I would just like the problem solved as I'm sure do you.
I really think you are overthinking it. The only high level difference between RSS Cloud and PSHB is that:
- PSHB pushes 1 item to you
- RSS Cloud pushes a request to you to pull that 1 item (or many items) from you.
That's it! It's just that one teeny tiny POST. And as a bonus, instead of putting the load on the POST, it puts the load on the GET (which can be cached).
You keep mentioning the distribution of pings, do you think those get scattered around the internet served by multiple Clouds and individual readers pull the update? It doesn't work that way. Here's how it works as I understand it: Blog A updates, Cloud sees the update, Cloud notifies 50 aggregators, aggregators GET the url the Cloud told it to, then aggregators serve their local copy of Blog A's feed to millions of people.
Think about this... You have 50 messages per minute that your subscribers' "hubs" need to be notified of. With PSHB you have to POST all 50 messages to the hubs, you don't have a choice for them to get all the content. With RSS Cloud you have the option of determining when to POST to your subscribers' "Aggregators" and can schedule that at an interval of each message or even just once a minute. Now which one doesn't scale?
I'm all for PSHB as well as RSS Cloud. I just entered this discussion to correct some of the misconceptions with why RSS Cloud can't work.
Another difference between RSSCloud and PSHB is that RSSCloud requires that the client requesting notifications be the one that receives notifications. PSHB allows clients to designate a hub that will receive notifications for them.
That's a very important difference! It allows you to subscribe someone else's endpoint on behalf of you. An example would be a Pubsubhubbub client that converts real-time update message to XMPP like this to enable PSHB on the desktop: pubsubhubbub-xmpp.appspot.com
Practically everyone here(I did skip to the end due to piling on) seems to be missing the point by confusing an end-user client with an XML-RPC client.
The end users need clouds. This is a server to server protocol.
The Hubs at PSHB won't be run by bloggers and readers. Neither will the Clouds. If Adobe Air clients want to accept real-time feeds they will need to implement their own cloud or use a third party cloud.
It looks like intentional FUD here from people that sound more knowledgeable than they are letting on.
"The end users need clouds. This is a server to server protocol."
Yay, someone else gets it.
"It looks like intentional FUD here..."
I'm trying to understand how a real-time infrastructure can be built that replaces what Twitter provides.
I get the "XML-RPC client" issue: these are server-to-server protocols. In this Use Case a Cloud/Hub to aggregator service.
It would be helpful if someone in the conversation (here or elsewhere) how real-time clients are going to get updated 140 character blog (or whatever publishing tool emerges) posts and display them in a "River of Tweets" format.
It's being proposed that this is just around the corner but I just don't see how it should be built... maybe I should just wait and smarter people than I will start to build it. Later I can seek an explanation of how it functions and measure it for "real-time" behavior.
There are users and developers but there are also people that just love to know how things are made. I am one of those people... how will it work.
OK... I *AM* totally full of shit.
Dave Winer's RRS Cloud Implementer's Guide does clarify a lot of my issues:
The Cloud implementer can GET and store the delta text.
We just need to see a large scale Cloud start-up and some Client
Apps that can get through Firewalls or are coded to work over port
80 (http) or 443 (https) which work for our browsers, iTunes (podcatchers), etc.
If I can poke a hole in my home firewall (5337) I'll keep digging to understand how this will all work.
(Of course, perl, python or ruby implementation should be popping up soon).
Doesn't Wordpress already have XMPP feed support they are using for some people? Why do this RSS foo when you have a platform that is designed to send low latency messages in realtime already hooked up?
"Why do this RSS foo[?]"
The intent of the RSSCloud spec is to bootstrap a centralized
service that conveys realtime changes to RSS-based blogs (i.e in seconds).
There end ou being 3-4 systems required to enable this capability:
1. RSSCloud enabled blogging tools (Wordpress is there).
2. An RSSCloud service (rsscloud.org is up for developers to test with)
3. (Ideally an aggregation service to take the request load off the blogger's service. To cache communities of realtime users "tweets"/posts.
4. Blog reading software for the users to see "Rivers of Tweets" like Twitter offers.
It's a call to start-ups to take the specs are develop services.
Let the games begin.
I never thought I'd write this... good information at TechCrunch:
The Technical, historical and social dimensions are covered (in comments) behind PuSH architecture.
The implementation deatails of the Hub/Cloud to make realtime work and scale are really hard. Ideally the Hub/Cloud is an aggregator that stores content to make it scale and reduce latency, traffic and load on the publisher.