Regular expressions are beautifully ugly

For several years, I've been unable to find a suitable Web server log statistics program for this server, which hosts several dozen virtual domains for myself and a few friends and relatives.

The commercial options such as WebTrends and Wusage cost more than I want to pay for a server-wide solution. The open-source and free-beer programs I have found are either skimpy on stats or can't handle sites that get millions of hits a year.

I've decided to write my own program in Java, a project I'm naming Logfreak. The initial goal is to write an application and class library that can read logs in Apache common and combined log formats and store statistics in a JDBC or ODBC database. Once that works, the stats can be retrieved for presentation on a JavaServer Pages or PHP front end.

The first thing I've learned is that I'll never deal with text again without using regular expressions. It isn't pretty to look at a pattern matching expression like this:

^(.+)\s(.+)\s(.+)\s\[(.+)\]\s"(.+)"\s(.+)\s (.+)\s"(.*)"\s\"(.*)"$

However, when it pulls 11 elements out of server log entries looking like - - [11/Dec/2003:15:17:04 -0500] "GET /visit.php HTTP/1.1" 302 5 "" "Mozilla/3.0 WebTV/1.2 (compatible; MSIE 2.0)" without being hosed by goofy user-designated referral and user-agent strings, I can appreciate the beauty of such ugly syntax.


Analog didn't cut the mustard for you?

Ditto on Analog, it's worked on every log file I've thrown at it, including one that had over 4 million visitors in a month.

In comparison, WebTrends constantly causes me terrible headaches...

I know you are programming it in Java, but you threw up the regular expression, sooo...

There's a Perl module called Apache::ParseLog. It can do the heavy lifting for you.

Thanks for the tips. I hadn't heard of either one of these. A spot check of Analog makes me wonder if it breaks down hits on a page-by-page basis. I need this; I've been getting it from an out-of-date and Y2K-non-compliant copy of MkStats for years.

Add a Comment

All comments are moderated before publication. These HTML tags are permitted: <p>, <b>, <i>, <a>, and <blockquote>. This site is protected by reCAPTCHA (for which the Google Privacy Policy and Terms of Service apply).