Atom May Not Be My Type

One of the distinct differences between RSS 2.0 and Atom Syndication Format 0.3 is the ability to declare the kind of information an Atom element holds.

In an Atom feed, elements such as weblog entries can have a type attribute that identifies the MIME media type of the content:

<p>Kalina, an 18-year-old killer whale at SeaWorld Orlando, gave birth to her fourth calf.</p>

Although I originally regarded this as a plus for Atom, as an expert in the format for two going on three days, I'm beginning to wonder whether this precision is worth the effort in practice.

There are a lot of weblog editing tools and browser-based services in which users author content without indicating whether it's plain text, HTML, or XHTML.

I turn off the WYSIWYG editor in Radio and use a simple text box to write Workbench because I have never used an HTML editing control that creates valid, cleanly composed markup.

For dorks like me, there's no reliable way for software to programmatically determine the media type of user-drafted text. If I entered "Lou Montulli will pay for inventing the tag," have I just authored "text/plain" or "text/html"?

The default media type for Atom content elements is "text/plain". I suspect that many weblog authors would balk at being required to compose entries as well-formed XML such as XHTML.

For these reasons, if I were writing an Atom-enabled weblog editor, I would follow Six Apart's lead and simply declare that my users are creating "text/html" content. That's the expectation of software that reads RSS 2.0, even without a declared media type:

<p>Kalina, an 18-year-old killer whale at SeaWorld Orlando, gave birth to her fourth calf.</p>

For authoring software aimed at non-technical people who create Web content, the type attribute seems like an elegant solution in search of a problem.


In Netscape's RSS 0.91, revision 3, there was a DTD, and it supported a number of entities. An example to make this clear:

<title>&iquest;Que Pasa?</title>

In general, RSS 2.0 feeds have more elements defined than were defined in the RSS 0.91 specification, so they can't make use of this DTD. However, experimentally, it appears that the Userland Radio aggregator supports entity-encoded HTML in titles. An example would be:

<title>&amp;iquest;Que Pasa?</title>

Although, iM usually known to debate in favor of RSS and against Atom, I have to disagree on this point. I like the type safety that Atom's content/type attribute provides. My complaint would be, hey why no just put the content element as an extension of RSS so we don't have AFOR.

"I don't need this for my particular situation in this app" doesn't strike me as a very strong argument.

Having a content model, any content model, is more important that having three, to be sure. If either format just said "the content model for this element is text/html" then you would know that under no circumstances could you render it without either an HTML agent, or an HTML parser that can safely strip out HTML tags and convert entities to characters. But RSS doesn't have a content model, it just has the phrase "(entity-encoded HTML is allowed)." Does that mean that the absence of angle-brackets coming out of the XML parser means it's text, or HTML without any tags? That makes a huge difference. Do you treat every RSS flavor's description as being HTML, or only 0.92/3/4 and 2.0? What is the content model for other 2.0 elements, especially item/title and channel/title and channel/description? Try treating them as text, since they don't say "(entity-encoded HTML is allowed)," and toss in a once-encoded angle bracket and look at how various aggregators treat it. Then try treating them as HTML, and toss in a twice-encoded angle bracket.

How can an application know how to deal with it? Presumably it know how to go from what it has stored to (X)HTML: do I take what he entered, and escape HTML-sensitive characters before putting it in an HTML page, or do I not? Unless you have an app which enforces well-formed XHTML, or a user who obsessively ensures that's what he produces, then you probably don't have XHTML that can safely be put inline, so you can either bung it in as escaped tag-soup HTML, or if your user has said that's what he wants you can remove the HTML and convert any entities to characters and put it in as text, or you can run it through HTML Tidy (good!) or do your own odd XHTMLification like Blogger (um...), and put it in as inline XHTML. But if your app doesn't know whether a literal less-than character followed by a p followed by a literal greater-than character is intended as HTML or as a plain-text example of the tag, you've got bigger problems than just choosing an Atom type attribute.

I need someone to write "the shorter Ringnalda."

Impossible to tell what type you wrote from the HTML rendering. Did you type a literal less-than then blink then a literal greater-than, that your weblog software entity-escaped before putting it in the source? Then you write in text/plain. Did you type an ampersand, lt, semicolon, ...? Then you write text/html. Do you have weblog software than runs your stuff through either an XML parser or a XHTML validator? If not, you almost certainly didn't write XHTML.

Is this an official statement from the RSS Advisory Board that it is politically acceptable to include Atom elements in an RSS feed? Even Atom elements that duplicate the functionality of core RSS elements?

If so, can you give us a timetable of when you will be updating the RSS Political FAQ?

When user-drafted character data is stored in an XML format for publication on the Web, isn't it a given that the text will be escaped with entity codes or a CDATA block? Users don't do well-formedness.

In the XML parsing library I use in Java, XOM, the parser ignores the manner in which character data is encoded as a child element. It's regarded as "syntactic sugar" -- you don't know and can't find out whether it uses a CDATA block or not. You simply deal with the result. To me, that sounds like the best approach when HTML is being ferried over XML. The point of character data in XML, as I understand it, is to carry something that can't be intelligently parsed.

Perhaps there are applications out there waiting for a chance to deliver XML in a syndication format -- after making this post, I've been trying to come up with some examples of apps that would require XML to be delivered in content elements, making use of Atom's type attribute.

I'm not trying to discount the possibility there's a use-case for these attributes I'm not seeing. But since Atom's offered as a two-way Web publishing solution, and users draft Web content, the assumptions inherent in RSS 2.0 seem like the content model that will overwhelmingly be adopted in Atom anyway: "type='text/html' mode='escaped'".

Mark: You know the answer to that question. This isn't the RSS Advisory Board Web site, so nothing here is official.

A user talking here...

I see things like the Atom elements you describe Rogers as core or foundation items for the future. It may not have a use now, but an app developer may find a way to use it to make his/her app the one everyone wants. I mean that from a "non-feed-reading" standpoint. RSS 2.0 will be the simplest method for an app developer to implement *right now* because the syntax is so simple. Hell, I hand wrote the RSS for my wife's website: It was easy and I couldn't figure out the accepted, non-flame-worthy Atom way.

Right now is RSS's time. Later, it might be Atom's. But I'm with you--some of the Atom stuff are great solutions in search of a problem no one's thought of. That doesn't make it bad overall, but it might makes it bad choice now for rapid deployment.

So are you saying that it is *unofficially* acceptable to include Atom elements in RSS feeds? Even Atom elements that duplicate the functionality of core RSS elements?

Can anyone give me a timetable of when Radio might support this?

Marc Barrot has been working on an Atom tool for Radio. When I finish my current book, I'm going to try helping him or write a new tool that reads Atom feeds.

For dorks like me, there's no reliable way for software to programmatically determine the media type of user-drafted text. If I entered "Lou Montulli will pay for inventing the tag," have I just authored "text/plain" or "text/html"?

The Shorter Ringnalda answer would be that you're giving the reason right here. There is no way: and that's precisely the problem when it comes to displaying it without having a declared media type.

As an example, Entity encoded html is allowed in RSS 2.0, but it can't be labelled as such - so there's no way to know the difference between <blink> and &lt;blink&gt; and &amp;lt;blink&amp;gt; - and then whether or not to actually have it blink. By declaring the media type, you get rid of all of this confusion, at least within the specification. If the authoring tool you're using has issues with this (understandably, in my opinion) that's not a reason to destroy the necessary clarity of a specification.

As an example, Entity encoded html is allowed in RSS 2.0, but it can't be labelled as such - so there's no way to know the difference between and <blink> and &lt;blink&gt; - and then whether or not to actually have it blink.

Entity-encoded HTML is displayed by RSS readers all the time with few show-stopping problems. Developers of RSS 2.0 clients expect elements to contain escaped HTML, so they convert it.

The post we are responding to contained the text <blink>, but it was not interpreted as a blink tag in Radio (and presumably other clients as well). The reason is that I composed it as &lt;blink&gt, knowing that RSS clients would convert escaped HTML.

Aargh. I composed it as &amp;ltblink&amp;gt;. I wonder if I'm undercutting my position with the difficulty explaining what I did in a manner this comment script is willing to accept.

Hosed it again! What I wrote was this:


Sometimes I think the escaped text issue seems bigger than it is because the authors of syndication tools run into it all the time when they present examples of the problem.

Alright, I'll amend that. It can't be shown for any of the fields in the base specification: It can't be shown for title or description. Yes, RSS 2.0 does have namespace support, and modules do exist that include the media-type definition, but the base spec does not - and the base spec is the most used, by far.

There are two parts to the problem: creating the feeds, and consuming them. Consuming them, as I explained, is ambiguous: with title and description in the base 2.0 specification, you cannot know definitively what to do with something that looks like escaped HTML without having a media type to base your parsing decision on.

Now, as Rogers points out, this need to declare the media-type comes from the default being text/plain and not text/html, and that this decision makes syntax more complicated that it needs to be. However, I think the reason behind this, and I don't know this for sure, as I have nothing to do with Atom, is that is is easier to have *everything* default to text/plain, and then declare it if it is different - as the majority of fields will just be text/plain.

A good example of this is the title element in 2.0 - as Phil says above, it's unclear what the content type is, so some people do put HTML in there and expect it to be parsed as such. Other put it in there and do not expect it to be parsed, they just want to call a post "<br/> is a lovely tag" without it actually breaking line. Atom seems to allow them to do this and have them be *specific* about how it is treated. RSS 2.0 does not.

That blogging tools, and specifically their interfaces, will find it challenging to get this distinction across to their users, is true. I know this. WYSIWYG editors ahoy, I say. It's certainly an issue to address in the next generation of publishing tools. Either way, having a syndication format that is unambiguous about its content is a plus, I would say.

Related discussion

:-) I propose a new subset of problems, "Cadenhead Issues", in which they only exist when you try to explain them. It might save a lot of time. It's a bit like the old joke:

Doctor Doctor, my finger really hurts when I do this.
Don't do it, then.

Seriously, though, if this is true:

Developers of RSS 2.0 clients expect elements to contain escaped HTML, so they convert it.

Would it not solve all problems by saying that "Description should always be assumed to contain escaped HTML" and be done with it. If it's so common, let's take the doubt away by amending the spec.

Eh... for non technical people who create web content, the doctype declaration seems like an elegant solution in sort of a problem. What about people who want to know how their stuff will render?! This is a really disingenuous dismissal.

Software developer's know (or should know) exactly what processing they do on their user's input to display it within their publishing system, and hence emit in their syndication feeds. If they treat it as HTML, then it's HTML, if they treat it as plain text, then it's plain text. If they kinda throw their hands up in the air and let the browser figure it out, that's their option too and should be clearly communicated to consumers. The difficulty Rogers had in entering a comment on his own system is an example of such a user interface in action.

Further detail in Undefined RSS content type: not the user's fault.

I knew that escape mangling would come back to bite me in the ass. From now on, all of my examples will be encoded in Base64.

Truth is, your just a bunch of power mongers trying to control the format and screwing the end user in the process.

Real world example -- the Blosxom atomfeed plugin looks at each entry to be included in the feed, and tests it for XML well-formedness. If the entry is wellformed, it emits it as 'application/xhtml+xml', if not, it emits it as CDATA.

Now was that so hard?

Mark: "Can anyone give me a timetable of when Radio might support this?"

Sharpreader, Bottomfeeder, Newsmonster, and WinRSS already support Atom elements in RSS 2.0. Three others are on the fence.

CDATA is one option for how the RSS 1.0 content module can work. It's nicely thought out prior art - August 2000 - and worth a look.

Roger: yeah, my Universal Feed Parser supports it too. But that's different than being supported in a reference implementation like Radio.

Example feed:

Bloglines supports it in channel/title and item/title

Bottomfeeder says the channel/title is &iquest;... and the item title is iquest;

FeedDemon 1.0 doesn't seem to care for channel/title at all, but supports it in item/title

Opera 7.5 doesn't support entities for either title.

BlogMatrix Jäger likes the channel/title, but (possibly unrelated) doesn't want to display the item at all.

FeedExpress doesn't support entities for either title.

Sharpreader supports entities in both titles, in two out of three panes, but not in the subscription list pane.

I need a bigger suite of aggregators.

Radio UserLand displays an upsidedown questionmark in both channel/title and item/title for Phil's feed.

RSSBandit displays &iquest; in channel/title, and an upsidedown questionmark in item/title.

NetNewsWire Pro displays:

The ¿Que Pasa? feed

as the feed title and the same for the item title.

I meant that it dispays ¿Que Pasa? properly in both instances.

Today's feed from The Register:

Funny. Though what kind of markup is that? Where's the closing tag?

The source reads like this (I added spaces after the amps)

Introducing & lt;cite& gt;Register& lt;/cite& gt; online training

So, NNW strips the slash. I don't know why, but I'm sure there's a good reason.

Add a Comment

All comments are moderated before publication. These HTML tags are permitted: <p>, <b>, <i>, <a>, and <blockquote>. This site is protected by reCAPTCHA (for which the Google Privacy Policy and Terms of Service apply).