Displaying Twitter Updates on a Web Page

I recently began using Twitter, a microblogging service for posting short, chat-like blog entries and reading what other users of the service are doing. The site has severe reliability problems, but it's still an entertaining way to get real-time updates from bloggers I read along with others I know who've been sucked into Twitter's maw.

I wrote some code to display my most recent Twitter update on my weblog, Workbench, in a sidebar at upper right. This afternoon, I've released the Twitter-RSS-to-HTML PHP script under an open source license. The script requires MagpieRSS for PHP, an open source PHP library that can parse RSS and Atom feeds.

MagpieRSS caches feed data, so at times when Twitter is glacially slow or can't be accessed, this script won't hurt the performance of your server.

The first release of the script only works with a Twitter user's RSS feed, which can be found in the "RSS" link at the bottom of a user's Twitter page. The only tough part about writing the script was creating regular expressions to turn URLs into hyperlinks and "@" references into links to Twitter user pages:

// turn URLs into hyperlinks
$tweet = preg_replace("/(http:\/\/)(.*?)\/([\w\.\/\&\=\?\-\,\:\;\#\_\~\%\+]*)/", "<a href=\"\\0\">Link</a>", $tweet);
// link to users in replies
$tweet = preg_replace("(@([a-zA-Z0-9]+))", "<a href=\"http://www.twitter.com/\\1\">\\0</a>", $tweet);

If you're reading this and wondering why anyone should bother with Twitter, I recommend reading the updates by Jay Rosen, a former university journalism chair who uses the service to share a running dialogue on the media. He punches above his weight in this 140-character-or-less medium.

How to Crash Your Apache Server with PHP

I returned from a trip out of town Monday to crashing web servers that ate my lunch all week long. For several days, I used the top command in Linux and watched helplessly as two servers ground to a halt with load averages higher than 100.

Top reports the processes that are taking up the most CPU, memory and time. On the server running Workbench, the culprit was always httpd, the Apache web server. This didn't make sense, because Apache serves web pages, images, and other files with incredible efficiency. You have to hose things pretty badly to make Apache suck.

If you know the process ID of a server hog, Apache can tell you what that process is doing in its server status report, a feature that requires the mod_status module. The report for Apache's web site shows what they look like.

Using this report, I found the culprit: A PHP script I wrote to receive trackback pings was loading the originating site before accepting the ping, which helps ensure it's legit:

// make sure the trackback ping's URL links back to us
$handle = fopen($url, "r");
$tb_page = '';
while (!feof($handle)) {
$tb_page .= fread($handle, 8192);
}
fclose($handle);
$pos = strpos($tb_page, "http://www.cadenhead.org/workbench");
if ($pos === false) {
$error_code = 1;
send_response(1, "No link found to this site.");
exit;
}

Most trackback pings are not legit -- I've received 600 from spammers in just the past three hours. Each ping required Apache to check the spammer's site, download a page if it existed, and look for a link to Workbench. A single process performing this task could occupy more than 50 percent of the CPU and run for a minute or more.

I'm surprised Apache ran at all after I added trackback a couple months ago. I was beginning to think the web server software was idiot-proof, but I've proven otherwise.

Loading Ad Javascript with PHP

I serve ads on the Drudge Retort using Blogads, a great ad broker that occasionally has trouble serving the ads. When this happens, pages on the Retort load more slowly because they can't fetch a Javascript program and CSS stylesheet required by Blogads.

I decided to fix this problem by writing Cache Remote File, a PHP script that performs three functions:

  • Save a cached copy of a remote file
  • Display the cached copy for 10 minutes before requesting the file again
  • Display the cached copy when the remote server is offline or slow

The script will give up trying to load the remote file after three seconds, which keeps it from hanging when the remote server is having difficulties. It can be customized to load any URL of any content type and requires PHP 4 or higher with cURL support. I've released it under the GPL. Let me know if you have any problems with it or can improve the script.

Weblog Pinger Extended with MySQL Database

I've added a MySQL database to Weblog-Pinger, my weblog update notification class library for PHP, so that it can track ping attempts and keep from hitting the same server too often.

Some notification services reject pings sent too frequently. When I was the king of pings for six months in 2005, Weblogs.Com rejected pings sent more frequently than once per half-hour. If you try to ping Ping-O-Matic too often today, you get the error message "Pinging too fast. Slow down cowboy. Please ping no more than once every 5 minutes."

The Drudge Retort uses Weblog-Pinger to send a ping to Ping-O-Matic, Technorati and Weblogs.Com whenever a new story hits the front page or site members update their blogs. Until this release, I haven't limited the frequency of pings.

Perhaps as a consequence, I'm having trouble getting Technorati to index the Retort. I ping the site and get back a successful response, but Technorati last accepted a Retort post 155 days ago and the Drudge Retort page on Technorati claims the blog doesn't exist.

It looks like Technorati is intentionally ignoring the pings in the mistaken belief that the Retort is a splog (spam blog). Technorati did some spring cleaning in March to detect and remove more splogs, which are a huge problem in the blogosphere. Technorati's a useful tool to find responses to your weblog posts and let other bloggers know you've linked to them. Being omitted from the index is a revoltin' development.

Weblog-Pinger, which is open source code released under the GPL, won't send a ping more than once per five minutes to any server for any URL.

Defending WordPress MU from Splog Abuse

Over the weekend most of my new WordPress MU weblog servers were hit by splogs -- spam blogs created by bots and filled with links to commercial sites.

I added a WordPress hacker's unofficial patch that requires users to fill out a captcha to create a new blog. The patch modifies wp-signup.php and adds a new file, wp-valid.php that generates the captcha graphic using code from the Quick Captcha PHP script.

The first two active blogs to spring up on these servers are Political Fretwork and the Ad Whisperers.

Update: I don't like how captchas break accessibility for visually impaired people, so I'm looking for a way to prevent that.

Detecting Weblog Spam with Comment Flak

Because I don't want to add captchas to Workbench, this weblog has been drowning in comment spam. Since I began accepting comments in September 2002, I've received 13,000 legitimate comments and 172,000 spam.

I'm trying a new technique this week that makes spam easy to detect by putting a bunch of bogus text areas on a weblog form, hiding them with Cascading Style Sheets, and checking them for input when the comment is submitted. I call these fields comment flak.

Spammers typically put their junk comment in every text area on a form. When text shows up in any of these flak fields, my blogging software treats it as spam.

I've written a new Comment-Flak library for PHP that makes it easy to use this technique on any weblog published with PHP.

So far, 100 percent of the spam submitted to this weblog has been caught by this technique. This will drop if the technique becomes popular, but I'm hoping people will offer tips on how to make it harder to beat. The code has been released as open source under the GPL.

Adding Atom 1.0 Support to RSS Sites

I switched to Atom 1.0 on Workbench two months ago, a move that hasn't been as smooth as I'd like because of one popular aggregator that doesn't support the format.

This site is created using Wordzilla, a LAMP-based weblog publishing tool that I've developed over the last year. Writing code to generate Atom feeds in PHP was extremely simple, since most of the code used to generate RSS feeds could be applied to the task.

Atom uses a different format for date-time values than RSS, so I had to write new date-handling code:

// get the most recent entry's publication date (a MySQL datetime value)
$pubdate = $entry[0]['pubdate']);
// convert it to an Atom RFC 3339 date
$updated = date('Y-m-dTH:i:sZ', strtotime($pubdate));
// add it to the feed
$output .= "<updated>{$updated}</updated>n";

This produces a properly formatted Atom date element:

<updated>2006-05-27T11:03:17Z</updated>

One thing I haven't been able to do with Really Simple Syndication is indicate an item's author, because RSS requires that an e-mail address be used for this purpose. Spammers snarf up e-mail addresses in syndicated feeds.

Atom supports author elements that can be a username instead:

<author>
  <name>rcade</name>
</author>

The most significant difference between RSS and Atom is the requirement that Atom text elements specify the type of content that they hold, which can be HTML, XHTML or text.

The content type must be identified with a type attribute:

<content type="html"><![CDATA[I own some Home Depot stock ...]]></content>

My Atom feed offers the text of weblog entries as HTML markup:

// get the entry's description (a MySQL text value)
$description = $e['description'];
// add it to the feed
$output .= "<content type="html"><![CDATA[{$description}]]></content>n";

Putting this text inside a CDATA block removes the need to convert the characters "<", ">", and "&" to XML entities.

When an Atom element omits the type attribute, it's assumed to be text.

The following PHP code creates XML-safe text for entry titles:

// get the entry's title
$title = $e['title'];
// convert the title to XML-safe text
$title = utf8_encode(htmlspecialchars($title));
// add it to the feed
$output .= "<title>$title</title>n";

The last difference I had to deal with is Atom's requirement that each entry have a title. Because I haven't written titles for all entries on Workbench, I wrote a function that can create a title from the first 25 words of an entry's description:

function get_text_excerpt($text, $max_length = 25) {
  $text = strip_tags($text);
  if (strlen($text) <= $max_length) {
    return $text;
  }
  $subtext = substr($text, 0, $max_length);
  $last_space = strrpos($subtext, " ");
  if ($last_space === false) {
    return $text;
  }
  return substr($subtext, 0, $last_space);
}

I switched to Atom whole hog, dropping the RSS feed and redirecting requests to the new Atom feed.

I quickly reinstated the RSS feed because I'm getting 4,000 requests a week from subscribers running Radio UserLand, which doesn't support Atom 1.0. Trying to subscribe in the current version, Radio 8.2.1, results in the error message "Can't evaluate the expression because the name 'version' hasn't been defined."

That's the only popular aggregator I've tested that doesn't support Atom 1.0, though I've read that the OPML Editor's River of News also can't handle these feeds.

I'm not going to support both formats on new programming projects just for Radio, because its users ought to nudge UserLand to upgrade Atom support to version 1.0. I'd like to redirect RSS requests to the Atom feed so that all subscribers are seeing the same thing and sites like Bloglines offer one subscription count. But dropping existing RSS support makes little sense.

Atom's content type requirement is a great improvement to syndication, allowing publishers to specify exactly what they're using a feed to carry. The RSS engine built in to Microsoft's next version of Windows produces RSS 2.0 feeds that have an extra type attribute in each description, even though it's not defined in the spec.

Settlement Reached with Dave Winer

I've reached an agreement with Dave Winer regarding the Share Your OPML web application. I destroyed his original code and user data along with everything that was built from it and gave up my claim to a one-third stake in feeds.scripting.com. He gave up the claim that he's owed $5,000.

I originally hoped one of us would buy the other out and launch the application, but we found a much stronger basis for agreement in a mutual desire to stop working together as quickly as possible.

If Share Your OPML was a Java project I would've been heartsick to destroy it, but I coded the application in PHP. I've never written anything in PHP I didn't want to completely rewrite six months later.

Some people think I'm an asshat for taking this public, and I won't argue with that, but I don't have the resources to fight an intellectual property lawsuit against a millionaire. Winer knows this -- he's been a guest in my home -- and it's clear his attorney was acting from the same assumption throughout the settlement negotiation.

I decided the best way to avoid court was to show Winer what it would be like to sue a blogger.

I figured the publicity would be a stronger motivator to resolve the matter than anything I could say through an attorney. He's one of the most galvanizing figures in the technology industry. If he ever sues someone, the publication of the case's motions and depositions will put a blog in the Technorati Top 100. Since publishing the letter from Winer's attorney, my traffic's through the roof, I'm getting fan mail and I received three programming job offers.

I'm extremely grateful for the public support and the offers to contribute to a legal defense fund on my behalf, which I was hoping might lead to a Free Kevin-style sticker-based political movement.

Some programmers have said that I was foolish to write the app on the basis of a verbal agreement, and I'll concede that wholeheartedly. I won't even do the laundry now without something in writing.

I'm not going to close the book on this debacle with any Panglossian happy talk about how it all worked out for the best. This was a completely unnecessary sphincter-fusing legal dispute that could have been settled amicably months ago without benefit of counsel.

But I'm glad to stop pursuing an application so closely associated with OPML, because I don't share Winer's enthusiasm for the format.

I used to feel differently, but now that I've worked with it extensively, OPML's an underspecified, one-size-fits-all kludge that doesn't serve a purpose beyond the exchange of simple data. There's little need for an XML dialect to represent outlines. Any XML format is a hierachy of parent-child relationships that could be editable as an outline with a single addition: a collapsed attribute that's either true or false.

Developers who build on OPML will encounter a lot of odd data because the format has been extended in a non-standard way. An outline item's type attribute has a value that indicates the other attributes which might be present. No one knows how many different attributes are in use today, so if you tell users that your software "supports OPML," you're telling them you support arbitrary XML data that can't be checked against a document type definition.

OPML's also the only XML dialect I'm aware of that stuffs all character data inside attributes. Now that OPML's being turned into a weblog publishing format, outline items will have ginormous attribute values holding escaped HTML markup like this:

<outline text="&lt;img src="http://images.scripting.com/archiveScriptingCom/2006/03/16/chockfull.jpg" width="53" height="73" border="0" align="right" hspace="15" vspace="5" alt="A picture named chockfull.jpg"&gt;&lt;a href="http://scobleizer.wordpress.com/2006/03/16/the-new-a-list/"&gt;Scoble laments&lt;/a&gt; all the flamers in the thread on &lt;b style="color:black;background-color:#ffff66"&gt;Rogers Cadenhead's&lt;/b&gt; site, but isn't it obvious that the &lt;i&gt;purpose&lt;/i&gt; of his post was to get a flamewar going? What non-flamer is going to post in the middle of a festival like that one? I'm not as worried about it as Scoble is, because I've seen better flamewars and I know how they turn out. In a few days he's still going to have to try to resolve the matter with me, and the flamers will have gone on to some other trumped-up controversy. The days when you could fool any number of real people with a charade like this are long past. And people who use pseudonyms to call public figures schoolyard names are not really very serious or threatening. &lt;a href="http://allied.blogspot.com/2006/03/lynch-mob-security.html"&gt;Jeneane Sessum&lt;/a&gt; is right in saying it's extreme to call this a lynch mob. It's just a bunch of &lt;a href="http://www.cadenhead.org/workbench/news/2881/letter-dave-winers-attorney#46458"&gt;anonymous comments&lt;/a&gt; on a snarky blog post. Big deal. Not.&nbsp;&lt;a href="http://www.scripting.com/2006/03/16.html#When:11:21:10PM"&gt;" created="Tue, 16 March 2006 11:21:10 GMT"/>

I'd be amazed if XML parsers can handle attribute values of any length, but that's what's being done today with OPML.

Now that an agreement has been reached, Winer doesn't have to share Share Your OPML and I can flee in terror before any border skirmishes lead to another XML specification war.

Maybe this is the best of all possible worlds.

Update: Winer appears to have launched a new PHP-based implementation of Share Your OPML with Dan MacTough.

Tracking Click Pings with PHP/MySQL

Earlier this week, Mozilla Firefox developer Darin Fisher announced that test builds of the browser include support for click pings, an experimental new HTML feature that makes it easier for web sites to track clicks on outgoing links:

I'm sure this may raise some eye-brows among privacy conscious folks, but please know that this change is being considered with the utmost regard for user privacy. The point of this feature is to enable link tracking mechanisms commonly employed on the web to get out of the critical path and thereby reduce the time required for users to see the page they clicked on.

Click pings work in web page markup by specifying one or more URLs in a link's ping attribute (an unofficial addition to HTML):

<a href="http://cnn.com" ping="http://drudge.com/receive-click-ping.php? url=http://cnn.com">Visit CNN</a>

When you click such a link using a development build of Firefox, the browser requests the ping link in the background as it loads the linked page. These pings can produce click usage reports.

I've created a new PHP class library, Poplink, that can receive click pings and report on the most popular links. It's released under the GPL and requires MySQL.

Mozilla's being hammered by privacy advocates since Fisher broke the news -- Chris Messina of Flock writes, "I feel like a piece of me is dying as a result of this."

Don't believe the gripe: Click pings are an improvement on the present situation. Any web publisher already can track clicks using HTTP redirects, and many do -- all ad brokers use the technique to track clickthroughs. This is a clumsy process that puts a click-counting script between the originating page and the destination, causing links to point to local scripts rather than their real destinations.

If click pings are adopted by browser developers, web users desiring more privacy could turn off these pings like they turn off pop-ups and referrer tracking, gaining a measure of control that's not available to them today. This also has the side effect of improving Google, which gets more real links and less redirect scripts fed to its almighty algorithm.

Spammer Messes with My Headers

A few weeks ago, I mistakenly believed that I had closed a PHP mail form vulnerability that let spammers use my web server to send mail.

Another batch of penis enlargement and phentermine pitches were sent through my server last night, which I discovered when "rejected bulk e-mail" bounces found their way to me. A spammer exploited a mail script I had written that coded the recipient address like this:

$recipient = "info@ekzemplo.com";

I thought the script was secure because users couldn't change the recipient. As it turns out, there's another giant programming blunder in these lines of code:

$name = stripslashes($_REQUEST['name']);
$email = $_REQUEST['email'];
$subject = $_REQUEST['subject'];
$comments = $_REQUEST['comments'];

mail($recipient, $subject, $comments, "From: ".$name." <".$email.">rn" . "Reply-To: ".$email."rn");

As I learned last night, plugging user-generated fields into PHP's mail function leaves you susceptible to header injection, a technique that sends multi-line input to any field on a web form in the hope that each line will be interpreted as a mail header.

A spammer in Seoul, Korea, sent the following as the name field when calling the script:

to
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: transcribe
bcc: charleselegb@aol.com

The script interpreted each of those lines as a real e-mail header, so charleselegb@aol.com received an e-mail with the text "2977b873a006112f1567c66ac468690a", which I'm guessing is encrypted text that identifies my server and script. The spammer's running software that crawls the web and hits forms, noting any that successfully send mail back to a test account. A Google search for that e-mail address shows Charles has been busy.

I've written a function that removes multi-line input -- identified by newline and return characters -- and prevents a spammer from defining multiple e-mail recipients:

function sanitize_request_input($input) {
  if (eregi("r", $input) || eregi("n", $input)) {
    $input = "";
  }
  $input = str_replace(";", " ", $input);
  $input = str_replace(":", " ", $input);
  $input = str_replace(",", " ", $input);
  $fields = explode(" ", $input);
  return $fields[0];
}

The function's called on any single-line text field contained on a form -- using it on multi-line textarea input would wipe out the text. The fix seems to have deterred Charles, who thoughtfully tried the exploit a few more times after I uploaded the new script, so I am again declaring victory over net abuse.

This is part two in an ongoing series.

UserLand Frees Up Manila Servers

UserLand Software is discontinuing free Manila hosting, as I discovered last week when one of their users sought refuge on Buzzword.Com. Edit This Page shut free service on Dec. 1 and ManilaSites will do the same Dec. 31.

I can offer free hosting on Buzzword, but webloggers who are committed to publishing with Manila should be advised that I'm migrating the server to new software by May 1, 2006. A better long-term option for those folks is to subscribe to Weblogger.Com or UserLand.

(As an aside, if you're a fan of a long-running blog on one of those servers, this would be the ideal time to donate a year's hosting. Moving a weblog in a hurry is a huge pain in the ass.)

I've found in 18 months of running Manila that I'm genuinely bad at it. Server uptime has been lousy, because you have to know enough to counter the enormous amount of abuse that comment and referral spammers dish out on a weblog server hosting 3,000 users. Every month or so, I get another "it's not you, it's me" letter from a Buzzword user who wants to break up but is afraid to sound ingracious. The most recent were David Golding and Julian on Software, and I'm pretty sure that Craig Jensen wants to start seeing other people.

I have a fighting chance against net abuse on a Linux box running Apache, MySQL and PHP, because I've been hacking away on one for more than five years. I knew I had reached a significant milestone in my quest for m4d sk1llz last spring when Workbench survived 500,000 hits in two days.

Next year, Buzzword will become an ad-supported free weblog host running entirely on Linux and other open source software. WordPress has a new multi-user version that's currently being beta tested. I suspect it will be the publishing tool that I choose.

Sites that are still on Buzzword at the time of the upgrade next year will be automatically migrated, so publishers can see whether they should stick around. I moved a Manila weblog to WordPress this weekend for a work project and it was easy -- the software supports RSS 2.0 as an import format.

In the meantime, Buzzword users may experience outages of unexplained origin for indeterminate length.

Closing a PHP Mail Form Vulnerability

I wrote a PHP script that accepts e-mail from web site visitors using a feedback form. The script works with different sites, routing mail to the right inbox with a hidden field on the form:

The who field doesn't specify an e-mail address, because that would be easy pickings for spammers. They crawl the web looking for e-mail scripts that can be configured to send e-mail to any recipient they specify.

Instead, my script was written to send mail only to accounts on my server:

$recipient = $_REQUEST['who'];

if ($recipient == "") {
$recipient = "rogers@ekzemplo.com";
} else {
$recipient = $recipient . "@ekzemplo.com";
}

Recently, a spammer found a way to make my script send e-mail to anyone on any server, generating hundreds of spams on my machine over a space of four days.

I'm curious to see if any programmers spot the giant honking vulnerability in the preceding code that I missed for months.

My Reign as the King of Pings

I've been running Weblogs.Com since June for Dave Winer, who wanted to see if service performance could be improved as he began to receive seven-digit inquiries about selling it.

Weblogs.Com ran on Frontier for six years from its founding in 1999, handling the load reasonably well until the number of pings topped one million per day within the last year.

In a frenzied weekend, I recoded the site as an Apache/MySQL/PHP web application running on a Linux server, writing all of the code from scratch except for XML-Simple, an XML parsing library I adapted from code by Jim Winstead. Hosting was provided by ServerMatrix, which charges around $80-$140/month for a dedicated server running Red Hat Enterprise Linux 3 with a 1,200-gigabyte monthly bandwidth limit.

On an average day, my application served 34.65 gigabytes of data, took 1.1 million pings and sent 11,000 downloads of changes.xml, a file larger than 1 megabyte. The LAMP platform is ideal for running a high-demand web application for as little money as possible.

When Dave rerouted Weblogs.Com to my new server and it instantly deluged the box with more than a dozen pings per second, I felt like Lucy Ricardo pulling chocolates off the conveyer belt.

The server ran well, crashing only a few times over four months because of a spammer sending thousands of junk pings per minute. Every few days, I used the iptables firewall to block requests from the IP addresses of the worst abusers.

Business reporter Tom Foremski and others have suggested that the Weblogs.Com sale might reveal a lack of faith in blogging as a business.

I think the sale was motivated by the realization that the demands of running Weblogs.Com had become much too large for Dave's one-man company. He could either hire people and start pursuing revenue opportunities or sell the service.

VeriSign got a good deal acquiring it for a reported $2 million. The company's now at the center of the blogosphere, a giant web application and information network with more than 15 million users, and ought to be able to leverage those pings into new services built on XML, XML-RPC and RSS.

One thing I'd like to see is a real-time search engine built only on the last several hours of pings, which could be a terrific current news service if compiled intelligently. While I was running Weblogs.Com, I wanted to use my brief moment as the king of pings to extend the API, which VeriSign appears to be considering, but Dave didn't want to mess with things while companies were loading a truck with money and asking for directions to his house.

I want to pursue these ideas, either independently or in concert with VeriSign and Yahoo Blo.gs. No knock intended, but big companies tend to sit on purchases like this rather than implementing new features. Blogger still lacks category support two years after being purchased by Google, an omission so basic you have to wonder whether it's serious about fending off competition from Six Apart, UserLand, and WordPress.

Displaying XML Data with PHP

I recently finished writing Sams Teach Yourself Programming with Java in 24 Hours, the fourth edition of an introductory book for Java programmers, which comes out in around two weeks.

I've been given wide editorial license with the book, so it contains unusual projects like Lottorobics, a lottery simulation applet that demonstrates why "Win the Lotto" is a terrible retirement plan.

The new edition adds chapters on XML and XML-RPC that use XOM and Apache XML-RPC, two great open source class libraries for Java. Programming projects in these chapters enable user lottery results to be tracked -- the applet sends the data to an XML-RPC server, which stores them in an XML file. In the years I've been offering the applet, users have won the lottery once in 4,877 simulated years of playing at a cost of $5.4 million dollars.

I wanted to display this lottery data on the web using PHP, so I've adapted some code for this purpose. The first release of the code, which I'm offering under the open source GNU Public License, shows how to read simple XML data with PHP.

If you do anything interesting with the code, and you don't mind releasing your work under the GPL, I'd like to add a few more sample programs to this project.