Adding Atom 1.0 Support to RSS Sites

I switched to Atom 1.0 on Workbench two months ago, a move that hasn't been as smooth as I'd like because of one popular aggregator that doesn't support the format.

This site is created using Wordzilla, a LAMP-based weblog publishing tool that I've developed over the last year. Writing code to generate Atom feeds in PHP was extremely simple, since most of the code used to generate RSS feeds could be applied to the task.

Atom uses a different format for date-time values than RSS, so I had to write new date-handling code:

// get the most recent entry's publication date (a MySQL datetime value)
$pubdate = $entry[0]['pubdate']);
// convert it to an Atom RFC 3339 date
$updated = date('Y-m-dTH:i:sZ', strtotime($pubdate));
// add it to the feed
$output .= "<updated>{$updated}</updated>n";

This produces a properly formatted Atom date element:

<updated>2006-05-27T11:03:17Z</updated>

One thing I haven't been able to do with Really Simple Syndication is indicate an item's author, because RSS requires that an e-mail address be used for this purpose. Spammers snarf up e-mail addresses in syndicated feeds.

Atom supports author elements that can be a username instead:

<author>
  <name>rcade</name>
</author>

The most significant difference between RSS and Atom is the requirement that Atom text elements specify the type of content that they hold, which can be HTML, XHTML or text.

The content type must be identified with a type attribute:

<content type="html"><![CDATA[I own some Home Depot stock ...]]></content>

My Atom feed offers the text of weblog entries as HTML markup:

// get the entry's description (a MySQL text value)
$description = $e['description'];
// add it to the feed
$output .= "<content type="html"><![CDATA[{$description}]]></content>n";

Putting this text inside a CDATA block removes the need to convert the characters "<", ">", and "&" to XML entities.

When an Atom element omits the type attribute, it's assumed to be text.

The following PHP code creates XML-safe text for entry titles:

// get the entry's title
$title = $e['title'];
// convert the title to XML-safe text
$title = utf8_encode(htmlspecialchars($title));
// add it to the feed
$output .= "<title>$title</title>n";

The last difference I had to deal with is Atom's requirement that each entry have a title. Because I haven't written titles for all entries on Workbench, I wrote a function that can create a title from the first 25 words of an entry's description:

function get_text_excerpt($text, $max_length = 25) {
  $text = strip_tags($text);
  if (strlen($text) <= $max_length) {
    return $text;
  }
  $subtext = substr($text, 0, $max_length);
  $last_space = strrpos($subtext, " ");
  if ($last_space === false) {
    return $text;
  }
  return substr($subtext, 0, $last_space);
}

I switched to Atom whole hog, dropping the RSS feed and redirecting requests to the new Atom feed.

I quickly reinstated the RSS feed because I'm getting 4,000 requests a week from subscribers running Radio UserLand, which doesn't support Atom 1.0. Trying to subscribe in the current version, Radio 8.2.1, results in the error message "Can't evaluate the expression because the name 'version' hasn't been defined."

That's the only popular aggregator I've tested that doesn't support Atom 1.0, though I've read that the OPML Editor's River of News also can't handle these feeds.

I'm not going to support both formats on new programming projects just for Radio, because its users ought to nudge UserLand to upgrade Atom support to version 1.0. I'd like to redirect RSS requests to the Atom feed so that all subscribers are seeing the same thing and sites like Bloglines offer one subscription count. But dropping existing RSS support makes little sense.

Atom's content type requirement is a great improvement to syndication, allowing publishers to specify exactly what they're using a feed to carry. The RSS engine built in to Microsoft's next version of Windows produces RSS 2.0 feeds that have an extra type attribute in each description, even though it's not defined in the spec.

Settlement Reached with Dave Winer

I've reached an agreement with Dave Winer regarding the Share Your OPML web application. I destroyed his original code and user data along with everything that was built from it and gave up my claim to a one-third stake in feeds.scripting.com. He gave up the claim that he's owed $5,000.

I originally hoped one of us would buy the other out and launch the application, but we found a much stronger basis for agreement in a mutual desire to stop working together as quickly as possible.

If Share Your OPML was a Java project I would've been heartsick to destroy it, but I coded the application in PHP. I've never written anything in PHP I didn't want to completely rewrite six months later.

Some people think I'm an asshat for taking this public, and I won't argue with that, but I don't have the resources to fight an intellectual property lawsuit against a millionaire. Winer knows this -- he's been a guest in my home -- and it's clear his attorney was acting from the same assumption throughout the settlement negotiation.

I decided the best way to avoid court was to show Winer what it would be like to sue a blogger.

I figured the publicity would be a stronger motivator to resolve the matter than anything I could say through an attorney. He's one of the most galvanizing figures in the technology industry. If he ever sues someone, the publication of the case's motions and depositions will put a blog in the Technorati Top 100. Since publishing the letter from Winer's attorney, my traffic's through the roof, I'm getting fan mail and I received three programming job offers.

I'm extremely grateful for the public support and the offers to contribute to a legal defense fund on my behalf, which I was hoping might lead to a Free Kevin-style sticker-based political movement.

Some programmers have said that I was foolish to write the app on the basis of a verbal agreement, and I'll concede that wholeheartedly. I won't even do the laundry now without something in writing.

I'm not going to close the book on this debacle with any Panglossian happy talk about how it all worked out for the best. This was a completely unnecessary sphincter-fusing legal dispute that could have been settled amicably months ago without benefit of counsel.

But I'm glad to stop pursuing an application so closely associated with OPML, because I don't share Winer's enthusiasm for the format.

I used to feel differently, but now that I've worked with it extensively, OPML's an underspecified, one-size-fits-all kludge that doesn't serve a purpose beyond the exchange of simple data. There's little need for an XML dialect to represent outlines. Any XML format is a hierachy of parent-child relationships that could be editable as an outline with a single addition: a collapsed attribute that's either true or false.

Developers who build on OPML will encounter a lot of odd data because the format has been extended in a non-standard way. An outline item's type attribute has a value that indicates the other attributes which might be present. No one knows how many different attributes are in use today, so if you tell users that your software "supports OPML," you're telling them you support arbitrary XML data that can't be checked against a document type definition.

OPML's also the only XML dialect I'm aware of that stuffs all character data inside attributes. Now that OPML's being turned into a weblog publishing format, outline items will have ginormous attribute values holding escaped HTML markup like this:

<outline text="&lt;img src="http://images.scripting.com/archiveScriptingCom/2006/03/16/chockfull.jpg" width="53" height="73" border="0" align="right" hspace="15" vspace="5" alt="A picture named chockfull.jpg"&gt;&lt;a href="http://scobleizer.wordpress.com/2006/03/16/the-new-a-list/"&gt;Scoble laments&lt;/a&gt; all the flamers in the thread on &lt;b style="color:black;background-color:#ffff66"&gt;Rogers Cadenhead's&lt;/b&gt; site, but isn't it obvious that the &lt;i&gt;purpose&lt;/i&gt; of his post was to get a flamewar going? What non-flamer is going to post in the middle of a festival like that one? I'm not as worried about it as Scoble is, because I've seen better flamewars and I know how they turn out. In a few days he's still going to have to try to resolve the matter with me, and the flamers will have gone on to some other trumped-up controversy. The days when you could fool any number of real people with a charade like this are long past. And people who use pseudonyms to call public figures schoolyard names are not really very serious or threatening. &lt;a href="http://allied.blogspot.com/2006/03/lynch-mob-security.html"&gt;Jeneane Sessum&lt;/a&gt; is right in saying it's extreme to call this a lynch mob. It's just a bunch of &lt;a href="http://www.cadenhead.org/workbench/news/2881/letter-dave-winers-attorney#46458"&gt;anonymous comments&lt;/a&gt; on a snarky blog post. Big deal. Not.&nbsp;&lt;a href="http://www.scripting.com/2006/03/16.html#When:11:21:10PM"&gt;" created="Tue, 16 March 2006 11:21:10 GMT"/>

I'd be amazed if XML parsers can handle attribute values of any length, but that's what's being done today with OPML.

Now that an agreement has been reached, Winer doesn't have to share Share Your OPML and I can flee in terror before any border skirmishes lead to another XML specification war.

Maybe this is the best of all possible worlds.

Update: Winer appears to have launched a new PHP-based implementation of Share Your OPML with Dan MacTough.

Tracking Click Pings with PHP/MySQL

Earlier this week, Mozilla Firefox developer Darin Fisher announced that test builds of the browser include support for click pings, an experimental new HTML feature that makes it easier for web sites to track clicks on outgoing links:

I'm sure this may raise some eye-brows among privacy conscious folks, but please know that this change is being considered with the utmost regard for user privacy. The point of this feature is to enable link tracking mechanisms commonly employed on the web to get out of the critical path and thereby reduce the time required for users to see the page they clicked on.

Click pings work in web page markup by specifying one or more URLs in a link's ping attribute (an unofficial addition to HTML):

<a href="http://cnn.com" ping="http://drudge.com/receive-click-ping.php? url=http://cnn.com">Visit CNN</a>

When you click such a link using a development build of Firefox, the browser requests the ping link in the background as it loads the linked page. These pings can produce click usage reports.

I've created a new PHP class library, Poplink, that can receive click pings and report on the most popular links. It's released under the GPL and requires MySQL.

Mozilla's being hammered by privacy advocates since Fisher broke the news -- Chris Messina of Flock writes, "I feel like a piece of me is dying as a result of this."

Don't believe the gripe: Click pings are an improvement on the present situation. Any web publisher already can track clicks using HTTP redirects, and many do -- all ad brokers use the technique to track clickthroughs. This is a clumsy process that puts a click-counting script between the originating page and the destination, causing links to point to local scripts rather than their real destinations.

If click pings are adopted by browser developers, web users desiring more privacy could turn off these pings like they turn off pop-ups and referrer tracking, gaining a measure of control that's not available to them today. This also has the side effect of improving Google, which gets more real links and less redirect scripts fed to its almighty algorithm.

Spammer Messes with My Headers

A few weeks ago, I mistakenly believed that I had closed a PHP mail form vulnerability that let spammers use my web server to send mail.

Another batch of penis enlargement and phentermine pitches were sent through my server last night, which I discovered when "rejected bulk e-mail" bounces found their way to me. A spammer exploited a mail script I had written that coded the recipient address like this:

$recipient = "info@ekzemplo.com";

I thought the script was secure because users couldn't change the recipient. As it turns out, there's another giant programming blunder in these lines of code:

$name = stripslashes($_REQUEST['name']);
$email = $_REQUEST['email'];
$subject = $_REQUEST['subject'];
$comments = $_REQUEST['comments'];

mail($recipient, $subject, $comments, "From: ".$name." <".$email.">rn" . "Reply-To: ".$email."rn");

As I learned last night, plugging user-generated fields into PHP's mail function leaves you susceptible to header injection, a technique that sends multi-line input to any field on a web form in the hope that each line will be interpreted as a mail header.

A spammer in Seoul, Korea, sent the following as the name field when calling the script:

to
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: transcribe
bcc: charleselegb@aol.com

The script interpreted each of those lines as a real e-mail header, so charleselegb@aol.com received an e-mail with the text "2977b873a006112f1567c66ac468690a", which I'm guessing is encrypted text that identifies my server and script. The spammer's running software that crawls the web and hits forms, noting any that successfully send mail back to a test account. A Google search for that e-mail address shows Charles has been busy.

I've written a function that removes multi-line input -- identified by newline and return characters -- and prevents a spammer from defining multiple e-mail recipients:

function sanitize_request_input($input) {
  if (eregi("r", $input) || eregi("n", $input)) {
    $input = "";
  }
  $input = str_replace(";", " ", $input);
  $input = str_replace(":", " ", $input);
  $input = str_replace(",", " ", $input);
  $fields = explode(" ", $input);
  return $fields[0];
}

The function's called on any single-line text field contained on a form -- using it on multi-line textarea input would wipe out the text. The fix seems to have deterred Charles, who thoughtfully tried the exploit a few more times after I uploaded the new script, so I am again declaring victory over net abuse.

This is part two in an ongoing series.

UserLand Frees Up Manila Servers

UserLand Software is discontinuing free Manila hosting, as I discovered last week when one of their users sought refuge on Buzzword.Com. Edit This Page shut free service on Dec. 1 and ManilaSites will do the same Dec. 31.

I can offer free hosting on Buzzword, but webloggers who are committed to publishing with Manila should be advised that I'm migrating the server to new software by May 1, 2006. A better long-term option for those folks is to subscribe to Weblogger.Com or UserLand.

(As an aside, if you're a fan of a long-running blog on one of those servers, this would be the ideal time to donate a year's hosting. Moving a weblog in a hurry is a huge pain in the ass.)

I've found in 18 months of running Manila that I'm genuinely bad at it. Server uptime has been lousy, because you have to know enough to counter the enormous amount of abuse that comment and referral spammers dish out on a weblog server hosting 3,000 users. Every month or so, I get another "it's not you, it's me" letter from a Buzzword user who wants to break up but is afraid to sound ingracious. The most recent were David Golding and Julian on Software, and I'm pretty sure that Craig Jensen wants to start seeing other people.

I have a fighting chance against net abuse on a Linux box running Apache, MySQL and PHP, because I've been hacking away on one for more than five years. I knew I had reached a significant milestone in my quest for m4d sk1llz last spring when Workbench survived 500,000 hits in two days.

Next year, Buzzword will become an ad-supported free weblog host running entirely on Linux and other open source software. WordPress has a new multi-user version that's currently being beta tested. I suspect it will be the publishing tool that I choose.

Sites that are still on Buzzword at the time of the upgrade next year will be automatically migrated, so publishers can see whether they should stick around. I moved a Manila weblog to WordPress this weekend for a work project and it was easy -- the software supports RSS 2.0 as an import format.

In the meantime, Buzzword users may experience outages of unexplained origin for indeterminate length.

Closing a PHP Mail Form Vulnerability

I wrote a PHP script that accepts e-mail from web site visitors using a feedback form. The script works with different sites, routing mail to the right inbox with a hidden field on the form:

The who field doesn't specify an e-mail address, because that would be easy pickings for spammers. They crawl the web looking for e-mail scripts that can be configured to send e-mail to any recipient they specify.

Instead, my script was written to send mail only to accounts on my server:

$recipient = $_REQUEST['who'];

if ($recipient == "") {
$recipient = "rogers@ekzemplo.com";
} else {
$recipient = $recipient . "@ekzemplo.com";
}

Recently, a spammer found a way to make my script send e-mail to anyone on any server, generating hundreds of spams on my machine over a space of four days.

I'm curious to see if any programmers spot the giant honking vulnerability in the preceding code that I missed for months.

My Reign as the King of Pings

I've been running Weblogs.Com since June for Dave Winer, who wanted to see if service performance could be improved as he began to receive seven-digit inquiries about selling it.

Weblogs.Com ran on Frontier for six years from its founding in 1999, handling the load reasonably well until the number of pings topped one million per day within the last year.

In a frenzied weekend, I recoded the site as an Apache/MySQL/PHP web application running on a Linux server, writing all of the code from scratch except for XML-Simple, an XML parsing library I adapted from code by Jim Winstead. Hosting was provided by ServerMatrix, which charges around $80-$140/month for a dedicated server running Red Hat Enterprise Linux 3 with a 1,200-gigabyte monthly bandwidth limit.

On an average day, my application served 34.65 gigabytes of data, took 1.1 million pings and sent 11,000 downloads of changes.xml, a file larger than 1 megabyte. The LAMP platform is ideal for running a high-demand web application for as little money as possible.

When Dave rerouted Weblogs.Com to my new server and it instantly deluged the box with more than a dozen pings per second, I felt like Lucy Ricardo pulling chocolates off the conveyer belt.

The server ran well, crashing only a few times over four months because of a spammer sending thousands of junk pings per minute. Every few days, I used the iptables firewall to block requests from the IP addresses of the worst abusers.

Business reporter Tom Foremski and others have suggested that the Weblogs.Com sale might reveal a lack of faith in blogging as a business.

I think the sale was motivated by the realization that the demands of running Weblogs.Com had become much too large for Dave's one-man company. He could either hire people and start pursuing revenue opportunities or sell the service.

VeriSign got a good deal acquiring it for a reported $2 million. The company's now at the center of the blogosphere, a giant web application and information network with more than 15 million users, and ought to be able to leverage those pings into new services built on XML, XML-RPC and RSS.

One thing I'd like to see is a real-time search engine built only on the last several hours of pings, which could be a terrific current news service if compiled intelligently. While I was running Weblogs.Com, I wanted to use my brief moment as the king of pings to extend the API, which VeriSign appears to be considering, but Dave didn't want to mess with things while companies were loading a truck with money and asking for directions to his house.

I want to pursue these ideas, either independently or in concert with VeriSign and Yahoo Blo.gs. No knock intended, but big companies tend to sit on purchases like this rather than implementing new features. Blogger still lacks category support two years after being purchased by Google, an omission so basic you have to wonder whether it's serious about fending off competition from Six Apart, UserLand, and WordPress.

Displaying XML Data with PHP

I recently finished writing Sams Teach Yourself Programming with Java in 24 Hours, the fourth edition of an introductory book for Java programmers, which comes out in around two weeks.

I've been given wide editorial license with the book, so it contains unusual projects like Lottorobics, a lottery simulation applet that demonstrates why "Win the Lotto" is a terrible retirement plan.

The new edition adds chapters on XML and XML-RPC that use XOM and Apache XML-RPC, two great open source class libraries for Java. Programming projects in these chapters enable user lottery results to be tracked -- the applet sends the data to an XML-RPC server, which stores them in an XML file. In the years I've been offering the applet, users have won the lottery once in 4,877 simulated years of playing at a cost of $5.4 million dollars.

I wanted to display this lottery data on the web using PHP, so I've adapted some code for this purpose. The first release of the code, which I'm offering under the open source GNU Public License, shows how to read simple XML data with PHP.

If you do anything interesting with the code, and you don't mind releasing your work under the GPL, I'd like to add a few more sample programs to this project.

Henri Bergius has incorporated code from my Weblog Pinger Library for PHP into the Midgard Content Management System.

This represents a Sally Field moment for me, the first time that any code I've written has made its way into another project thanks to an open source license. I'm going to celebrate my increased geek cred by buying something like this.

Changing Weblog Software is Drudge Work

I just finished moving the Drudge Retort from Movable Type to Wordzilla, my PHP/MySQL software that runs Workbench, giving all 14,400 weblog entries and 233,000 user comments a new home. The project took 10 days, around eight more than I expected.

The Retort is emulating Daily Kos by giving site visitors the tools to create their own blogs. I'm going to choose interesting user blog entries for the main page and home page to run alongside my own blog entries -- I've always wanted to give the kids a chance to drive the family car.

There are 270 users who've written blog entries. One of the most active is by Niceville, a relentless contributor whose politics lie to the right of Alan Keyes, even though the Retort leans left. User blogs support visitor comments, RSS feeds, page caching, membership, and ping notification. Anyone who wants to try it out can join the site and begin blogging.

From now on, I'll be multiplying all project time estimates by five. My next book, an irreverent beginner's title on Java 2 version 1.4, will be completed in November 2007. Pre-order it today.

Let's Put Everything on the Table

Of all the insults I received for popesquatting, the ones that stung the most were about my web skills, such as this comment on MetaFilter:

Eh, his website needs work. The text overflows the white box and he must've used the nowrap attribute as there is a hideous amount of rightwards scrolling. pls fix ur website b4 u sho it to teh whirled, pls ok tks.

Ouch. F U 2.

I like three-column designs, so I lay out my sites with HTML tables, often putting ads in the rightmost column. This lends itself to a creative trick some cranks like to employ -- putting a really long word in a comment to hose my layout and push ads way off the page, depriving me of money I need to put food on my family.

I'm currently moving the 14,000 weblog entries and 232,000 comments on the Drudge Retort from Movable Type to my Wordzilla PHP/MySQL software, so I wanted to solve this long-word problem.

I can't use PHP's wordwrap() function, which breaks long words that exceed a set maximum, because my weblog comments include hyperlinks. Any URL longer than the maximum would be broken.

I found a nice open source PHP script by Brian Huisman, htmlwrap, that solves this problem:

htmlwrap() is a function which wraps HTML by breaking long words and preventing them from damaging your layout. This function will NOT insert br tags every "width" characters as in the PHP wordwrap() function. HTML wraps automatically, so this function only ensures wrapping at "width" characters is possible. Use in places where a page will accept user input in order to create HTML output like in forums or blog comments.

Weblog Comments Near and Far Out

I'm coding this weblog myself in PHP and MySQL, writing software that I will eventually release under the name Wordzilla. A new recent comments sidebar on Workbench makes it easier to follow active discussions on old weblog entries.

Running a weblog with open comments attracts some unusual discussions when people using a search engine find familiar names in an old entry. For two years, Workbench has hosted an ongoing soap opera between the current and former spouses of Atlanta Journal-Constitution reporter Ron Martz.

The sidebar also will show how much comment spam I have to weed out, even though I refuse comments with three or more links and actively ban senders. In the last six months, I have banned 1,263 IP addresses used by spammers. They haven't gotten the message -- an additional 21,043 attempts have been rejected from those addresses.

Extended Pings in Weblog Pinger

RSS support was added to Weblogs.Com this morning, making it possible to send an extended ping message to the service that includes the address of a site's RSS feed.

This will make it easier for services that are built atop Weblogs.Com, such as Technorati and GigaDial, to incorporate RSS feeds.

I have extended my pinger to support this new feature.

Weblog-Pinger, an open source class library for PHP, can send update notification pings over five XML-RPC services that monitor new weblog content.

Server Attacked at Random

My server has been under attack for three days by a user in Colorado who requested the same URL 8.3 million times (and counting).

The user, making simultaneous connections from eight IP addresses in a block controlled by Time Warner Telecom, requested a URL on URouLette that redirects to a random web site -- as many as 30 requests a second to a PHP script that made a MySQL database connection. I'm guessing the motive was to acquire web addresses for e-mail harvesting or some other form of net abuse.

By yesterday morning, the requests were crashing everything on the server that could be crashed. It's a sign of how well Linux, Apache, MySQL, and PHP work that it took so long to bring down the box.

After sending an e-mail to the ISP's abuse address, I tried to solve the problem by adding an Apache configuration deny from directive that blocked the user's access to the site:


Order allow,deny
Deny from 66.195.191
Allow from all

After rebooting Apache, the abuser's requests were rejected with HTTP status code 403 Forbidden.

This worked briefly, prevented MySQL from running out of connections, but after a few hours Apache began freezing up and would serve no requests.

I wasn't able to fix this until I started the iptables service firewall on my server and told it to completely block the offending IPs with commands like this:

/sbin/iptables -I INPUT -s 192.0.34.166 -j DROP

This appears to have worked.

After 24 hours, I'm still waiting to hear from a human at Time Warner Telecom's abuse desk. My own hosting provider, ServerMatrix, has been fast to respond but doesn't seem inclined to contact the other company. I was hoping they could talk admino a admino.

Idiotically enough, the data that the user expended 100GB of my bandwidth trying to get is freely available on the web. URouLette makes use of Open Directory Project data, sending visitors to random sites that its editors have marked as "cool."