Elegant Solution to Non-Existent Problem

An article by Mark Pilgrim in XML.Com states that "Really Simple Syndication is really only simple if you're doing it incorrectly," using the guid element as an example.

That's a bogus claim to make about guid, a globally unique string that serves a simple purpose: Making sure that an RSS reader doesn't show the same item twice.

Pilgrim's article provides a nice tutorial on how to normalize URLs for use as guid values, but he neglects to mention a salient fact: This solves a problem that no one is having.

URLs can be expressed in multiple ways and still reach the same place: Note how python.org/%7eguido and python.org/~guido both load the home page of Guido van Rossum.

I'm not aware of any RSS producing software that creates guid URLs in multiple formats -- in other words, using example.com/~rogers/111 in one item and example.com/%7erogers/112 in the next.

That's the only problem that might be solved by forcing an RSS producer to normalize a URL (text replacements to change things like "%7e" to "~"), and even then, most feed readers are unlikely to need the help.

In RSS, all you need for the guid element is a globally unique string, which could be a URL or another naming scheme such as TAG URIs. You can make up your own -- as long as feed producing software is self-consistent in how it creates unique strings, any will work fine as a guid.

One of the best reasons for the simplicity of RSS is to avoid creating unnecessary work for implementors. Any guid will be treated by feed-reading software as a string, regardless of whether it's a URL or not, so there's no benefit to the knowledge that a URL has been normalized.

I enjoy watching Pilgrim dive into this stuff, because there's no one on Earth who loves Internet specs more than he does, but I wish he'd abandon the Swift Boat Veterans for Atom approach and spend less time blasting RSS.

The simplicity of RSS has been proven by the popularity of the format, which can be bent but rarely broken. There wouldn't be 119,000 RSS feeds and counting if it was as complicated as he makes it out to be.

Though I've been working with RSS for several years and my hair is prematurely gray, I believe the two are completely unrelated.

Comments

We've been here before.

http://www.intertwingly.net/blog/2003/06/25/There-is-no-FAQ#c1056577438

but what about when multiple sites point to the same source? or when one site points to the same source, posted by multiple authors ( e.g. http://del.icio.us/ )?

"If you trust a guid provider to normalize URLs, why can't you trust them with the simpler task of generating guids in a self-consistent manner?"

it's not just self-consistent that we need to worry about; it's not S-CUIDs we're talking about.

i never said i trust anyone to carry out normalization, but that doesn't mean it wouldn't solve the problem if they did. this isn't any different from any other spec requirement. you can't force people to follow specs, but that's no reason not to have them.

"On the larger issue of URLs and normalization, Mark's suggestions don't provide anything close to full assurance of correctness."

doesn't "correctness" suggest that there's a problem to be corrected? i never said mark was perfect. you said no one has the problem mark discussed. i said i have that problem. if there are better solutions, i'd like to hear them, but you're just describing another problem. that doesn't mean the first one isn't still there.

"Also, if URL normalization is the way to go..."

if URL normalization is not the way to go, what is?

i can't address issues with atom or the normalized URL RFC, because i have nothing to do with either. but i use (and produce) feeds and i can see that if they had normalized GUIDs, they would be more useful to me.

Whether or not it's a problem depends on how much hope you combine with your pessimism. I doubt that there will ever be the one true source of feed-search, so I expect to always subscribe to multiple feeds that include the same items, but someday I'd like to have an aggregator that sees that one item appears in the original source feed that I subscribe to, two "PlanetFoo" feeds I also subscribe to, and three different keyword search feeds, and collapse all that into one item, even though some of the republishers saw that the original guid was a permalink, and thus a URL, and loaded it into a URI object in their language, and reserialized it in a different way, but no longer as a permalink (since it isn't *their* permalink), keeping my aggregator from being able to re-canonicalize it. I've been led to believe that's too hopeful, but I'm not ready to give up yet.

hmmm... I have this problem.

In RSS, I'm unclear on why someone picking up a feed item for redistribution in another feed would normalize a URL for use as a guid. Taking someone else's globally unique identifier and changing it, for any reason, seems like bad practice, because you lose the publisher's ability to assure that it is unique.

Where are you having a problem solved by URL normalization, Robert?

Scott: I don't know deli.cio.us, but if I understand the question, as long as an RSS feed provider puts globally unique GUIDs in feed items, they can be used verbatim in any other feed.

Rogers: I meant with RSS guids in general. Pilgrim's diagnosis is spot on. Furthermore, this is syndication, so one client can consume an entry, normalize the GUID, and republish. See why normalization is a problem? Most RSS guids are URIs, and have all the problems that mark mentions.

If you want to build a truly interoperable software system, you have to be specific. Besides, as the author of the Universal Feed Parser, don't you think Mark might have a clue as to whether feed readers need the help?

The article is also incorrect about the WG's conclusions (though it was correct at the time it was written).

Robert, you say the article is wrong about WG conclusions - got a pointer to the current situation? I've not had chance to follow the list much, but the Wiki still gives the impression of the id issue been open.

Rogers, I'm not sure canonical URIs offer much benefit, but I am sure URIs, particularly HTTP scheme URIs are a better choice than arbitrary strings because of their tendency to be unique (thanks in part to their common use with HTTP). They're part of a larger infrastructure, rather than determined on the whim/hack of a publisher.

I also don't think it's safe to say that simply because no-one is having a problem right now that it won't ever be an issue. It wasn't very long ago that the ambiguity of escaping HTML in content was "not a problem", despite a lot of fingers having pointed out the flaw. Then a very visible case appeared (Reuter's dropped tags).

Of course when there is a high-profile guid clash, you can be sure it'll be declared a Ruby/Pilgrim conspiracy again.

who is ruby pilgrim?

Ruby Pilgrim is Bull Mancuso's ex-girlfriend. They had a very messy breakup and we're still dealing with the consequences.

Furthermore, this is syndication, so one client can consume an entry, normalize the GUID, and republish. See why normalization is a problem?

From my perspective in RSS, normalization is the problem, not the lack of it. A GUID has a single purpose: to uniquely identify an item no matter whose feed it appears in.

If you start normalizing them when a GUID is a URL, you're adding work that's completely unnecessary for the GUID to serve its purpose. If Radio RogerLand generates GUID URLs with %7e in place of tilde, and always does that, it creates no problems for feed readers.

rogers said "they can be used verbatim in any other feed" which is true, but they aren't used verbatim, which is a real and current problem. and it's a problem that can be solved by URL normalization. here's two scenarios, both of which happen in real feeds today:

1) you publish a feed in application A. someone else writes application B, which parses your feed and republishes it. application B parses the URL and interprets it as a URL object rather than a string. when it recreates the string version of the URL, it formats it differently. (this may be a bad thing for application B to do, but the RSS spec doesn't say it is.) the two feeds point to the same place, but the URLs aren't the same. problem.

2) del.icio.us has a system that allows anyone to point to websites with annotation. everyone uses slightly different syntax when pointing to a website, so two people pointing to the same page with URLs that are canonically the same, but have different string representations. (maybe the users shouldn't do this, but they aren't going to read the RSS spec even if it tells them this.) problem.

URL normalization seems like the obvious solution to both of these problems. what is your alternative solution?

Boo hoo I want to change your guid. Boo hoo if I can't do it I have nothing to complain about. Boo hoo I'm so sad. Whine whine whine.

Changing a unique guid should never be done in my opinion. It is the same as getting data from a database, changing the record-ID by adding a few zeros and prefix in front of it and then complain when you feed the data back and the database does not recognize it.

If I publish a guid I expect other applications keep it. It's unique, it's mine, don't change it.

I'm sorry, but this is a problem that basically *everyone* is having. At least as most of this planet's population does actually *not* live in the U.S.

Just as every character set-related problem, RSS is only simple if you use the typical american's ignorant view regarding other languages. XML was amazingly simple until they found out that no one in Japan would be able to use it... and suddenly the spec became the monster it is today.

What apparently few understand here is that 2 strings that appear to be the same here, might just not be the same on the other side of this rock. Such ambiguities are a disaster waiting to happen.

[joke]
But depending on your vote in November the problem with non-Ascii contries might just go away anyway
[/joke]

"Changing a unique guid should never be done in my opinion."

saying "don't change the URL" is just enforcing normalization at the human level, and normalizing to the orginal rather than a universal standard. that has two problems: 1) people are less reliable than code and 2) what if the original is inaccessible? e.g. what if i only have access to a copy, and don't know if it's been changed or not? or what if there's no feed for what i want to point to?

"It is the same as getting data from a database, changing the record-ID by adding a few zeros and prefix in front of it and then complain when you feed the data back and the database does not recognize it."

the problem is that the database (i.e. webserver) does recognize it, but aggregator clients don't (unless they normalize). publishers aren't complaining about this problem - end users are. but end users aren't creating the problem - publishers are. so it doesn't work to blame the complainer.

mb, that's two domains hosting the same content (could just as easily be www.example.com and www.completelydifferentexample.com) and it's not something that can be solved by normalization or anything else. you could write "no replication of content on multiple domains" into the RSS spec, but no one would follow it because it's default behavior on many servers.

What do character sets have to do with guids? The byte sequence ordinarily isn't even meant to be interpreted, just matched. Two binary strings that are identical here, are identical everywhere. Even if you encode them (and you shouldn't), the results of encoding two identical strings would be identical.

Besides that rather large and obvious point, I'd love to see what RSS/web software you have that doesn't support URLs.

RSS is only simple if you use the typical american's ignorant view regarding other languages.

That's nice. But would you mind keeping your own ignorant, uninformed, bigoted, and flat-out wrong views about a large population of innocent people to yourself? It reflects poorly on your own education.

Scott: Your del.icio.us example involves using URLs for their real purpose -- referring to Web sites. That has nothing to do with the need for a unique string identifier for an RSS item. I don't think a feed reader should ever change the text of GUID before passing it on.

One of the reasons I like using the TAG URI for GUID values is because it avoids this confusion over the element's purpose. A GUID's just a string identifier, regardless of whether the GUID's a TAG URI, URL, or some arbitrary format created by the feed producer.

Just as every character set-related problem, RSS is only simple if you use the typical american's ignorant view regarding other languages.

Character set issues have nothing to do with RSS. It's an XML dialect; characters should be expressed in either UTF-8 (the default for XML dialects) or the set declared by the encoding attribute.

did you actually read Mark Pilgrim's articles? RSS inherited all of XML's character set problems. Reuters found that out when their feeds didn't work in most newsreaders. Not to mention my personal problems with encoding German Umlaute so they show up in all newsreaders. If you give a GUID semantic meaning you have to account for that because binary comparability does *not* work in these cases.

Even if I'm completely wrong, my not understanding should then be a clear indication that this issue needs a lot more documentation.

That aside.. the question on how to compare an URL is raised in the article, so I don't repeat that here.

@dak: Thanks for commenting on my education. I'm sorry that I offended a few million innocent americans (please, don't invade!). I'm also sorry that you're so easily offended. Now, please go and read up on why the notion of "binary comparison" brought us all these problems in the first place. Making uneducated guesses and obsiously uneducated posts, chiming in on an issue you have an opinion on, but no idea of how it works... to invoke these stereotypes again: I'd say you're a typical american. Just that you're so easily offended... that somehow makes you a canadian in my mind. (I'm joking here!!! But understanding the technical problems of URLs, character sets and XML is important and Rogers' post shows exactly why these problems are here to stay. Because everybody thinks they're unrelated. Unless you happen to try and use American SOAP-Web-services in Russia, Japan, China, France, or anywhere else where US-ASCII isn't the only subset of UTF-8 you're using).

Mark gives an Unicode example in his article for starters.

Thanks.

I don't know what nation you're from, Joh, but its population is dense. :-).

Putting character set problems on RSS is inaccurate because they're an XML issue and should be addressed at that level. Neither RSS nor Atom is in a position to do anything about them.

A newsreader's XML parser determines how well it supports character sets and handles characters like German umlauts. Neither RSS nor Atom has anything to do with it. You'd run into the same problem in any XML dialect.

Germany.

Of course, in a way you're right. The problem starts with XML (actually I'd say it starts with "char *", but the XML spec is bad enough). As I said (post 20) RSS inherited this mess, it didn't create it. Mark wrote extensively about all of that. XML "solved" these issues by getting quite complicated to implement right, and most parsers don't. RSS supposedly is simple, because it doesn't care about these problems.

RSS and Atom both have to overcome the same difficulties, though, namely encoding the document in the specified charset (difficult enough) and then making sure that no field specifies "meaning" that might suddenly change with the encoding (RSS does nothing about that).

German Umlaute are now allowed in domain names in the DNS system. Different encodings might mean that you can't use a GUID as an URL, because you can't resolve the domainname in UTF-8 decomposed form (unfortunatly, at the moment I don't know how to query a DNS-Server for such a domain name).

So either you say that a GUID is a binary bitwise comparable unique entity, or you say that it can be an URL, it can't be both. Any way, you've got to solve these problems, or they will come and stab you in the back. Most XML libraries tell you that they can encode a string in UTF-8, but that doesn't mean that if this string represents an URL it happens to be the same URL afterward. That's not the problem of the XML parser though, because it shouldn't care. The *string* stays the same.

[opinion]
I've watched a multi-million euro project fail because of wrong character encoding support in a customer database that should have been outsourced to India. A good side-effect is that the connected jobs stay in Germany. As RSS grows internationally (wildly), someone will run into these problems. Unfortunately the spec is frozen, so we're not willing to do anything about it.
[/opinion]

hmm... strings which don't change sound like the right answer.
pointing to www.scripting.com and scripting.com, can't be normalized in code, though they're identical, which is right?
best to look at the feed for an item and use whatever is there...

Rogers: "A GUID has a single purpose: to uniquely identify an item no matter whose feed it appears in."

Now you're getting it! Unfortunately, that's not the situation we actually face, where a guid may be a permalink, or may not, and if it is may be the same permalink that's in the link element, or may not, and so based on a complicated heuristic you sometimes need to use the guid as a URI, and some programming languages will then do things behind your back with it before you re-emit it. While you certainly should, while normalizing items, stuff guid in a string guid, and if it's also a permalink stuff it in a totally separate uri linkydinky, that doesn't mean everyone doing things with RSS completely anticipated the recent discussion where that stumbling block came to light, or that everyone else will in the future.

rogers, if del.icio.us was doing normalization of GUIDs, i wouldn't get notification of multiple posts pointing to the exact same content in my RSS feed. from my perspective, this is a problem normalization would solve. when you say that normalization is a problem no one has, are you saying this is not a problem, or that normalization wouldn't solve it, or that i don't exist?

"I don't think a feed reader should ever change the text of GUID before passing it on."

as i said before, that's solving the problem by normalization to the original. i'm having trouble reconciling this with your original post.

using TAG URI's doesn't change the issue of normalization at all, and unicode is completely off-topic.

Phil R is an arrogant bastard. What's your angle? Don't tell me you just want to make the world a better place (heard that before), because that can be done much better without being such a prick.

Scott: I'm saying that normalization of guids for use as guids is neither necessary nor desirable.

Normalization of URLs for other purposes is a completely unrelated issue.

@scott reynen: Unicode is off-topic as long as you don't include character set normalization in "your kind of normalization". Unfortunately, as soon as URLs can contain characters that are not included in US-ASCII it becomes a problem.

Hell, in the article this discussion is about, there even is an example of that. Is this Slashdot? Please read the article.

Anyway, normalization *is* the issue. It unfortunately has to include character sets, but at least you see the general problem. That's 2 steps from where Dave Whiner is currently standing .

rogers, why does it matter if it's a URL or not? any two strings that are different won't resolve to the same content without normalization, and that's a problem whenever both strings are meant to refer to the same content. this is just as true of proper names as it is of URLs. if i call myself "scott reynen" in one place and "Scott Reynen" in another (which i and - more relevant to this topic - others do), a universal system will need to normalize my name at some point to identify the two as pointing to the same entity (me). those aren't URLs. that's my GUID. i still don't see how this isn't a problem.

If you trust a guid provider to normalize URLs, why can't you trust them with the simpler task of generating guids in a self-consistent manner? If all of my weblog's guids used "%7e" in place of "~", they would serve as guids successfully for any feed consumer that did the right thing and left the guids unmodified.

On the larger issue of URLs and normalization, Mark's suggestions don't provide anything close to full assurance of correctness. Different servers use different index filenames, and sometimes the same server uses different ones in different directories.

Here on cadenhead.org, I'm using index.php in some places, index.shtml in others, and index.html or index.htm in the rest. For this reason, there's no way to know whether the following two URLs are identical or not, simply by looking at them:

http://www.cadenhead.org/
http://www.cadenhead.org/index.php

That's a much more common problem for a service like del.icio.us than someone putting "%7e" in place of "~".

Also, if URL normalization is the way to go, will Atom require that all URLs in an Atom feed be normalized, including the ones contained in weblog entry text?

How would my feed become more useful to you if I did the text substitutions necessary to normalize the URL in a guid? My guids are already unique. They don't become more unique through normalization.

i don't subscribe to your feed, so it's not going to be useful to me.

but hypothetically (and this is only hypothetical in regards to your particular feed. this is a real problem i have with feeds i'm currently subscribed to), if you and anyone who replicated your feed were both required in a spec to normalize to the same GUID (whether it be the URL normalization standard or any other standard, such as originality), i could subscribe to two feeds pointing to the same content and not waste my time reading the pointers of both posts.

normalization doesn't GUIDs more unique; it makes them more global.

An earlier, possibly related thread.

if you and anyone who replicated your feed were both required in a spec to normalize to the same GUID (whether it be the URL normalization standard or any other standard, such as originality), i could subscribe to two feeds pointing to the same content and not waste my time reading the pointers of both posts.

You get the same benefit if people leave my guid alone when redistributing my items. If people are changing my guid, I think they are breaking it.

Do you have any examples of sites that pick up guids rather than generating their own? Looking at a few places I thought might do it -- Topic Exchange and Feedster -- I haven't found one yet.

Rogers - see

http://www.intertwingly.net/blog/2004/08/25/Preserving-Identity

(if you haven't already)

Add a Comment

These HTML tags are permitted: <p>, <b>, <i>, <a>, and <blockquote>. A comment may not include more than three links. This site is protected by reCAPTCHA (for which the Google Privacy Policy and Terms of Service apply).