Regular expressions are beautifully ugly

For several years, I've been unable to find a suitable Web server log statistics program for this server, which hosts several dozen virtual domains for myself and a few friends and relatives.

The commercial options such as WebTrends and Wusage cost more than I want to pay for a server-wide solution. The open-source and free-beer programs I have found are either skimpy on stats or can't handle sites that get millions of hits a year.

I've decided to write my own program in Java, a project I'm naming Logfreak. The initial goal is to write an application and class library that can read logs in Apache common and combined log formats and store statistics in a JDBC or ODBC database. Once that works, the stats can be retrieved for presentation on a JavaServer Pages or PHP front end.

The first thing I've learned is that I'll never deal with text again without using regular expressions. It isn't pretty to look at a pattern matching expression like this:

^(.+)\s(.+)\s(.+)\s\[(.+)\]\s"(.+)"\s(.+)\s (.+)\s"(.*)"\s\"(.*)"$

However, when it pulls 11 elements out of server log entries looking like 209.240.205.61 - - [11/Dec/2003:15:17:04 -0500] "GET /visit.php HTTP/1.1" 302 5 "http://www.uroulette.com/" "Mozilla/3.0 WebTV/1.2 (compatible; MSIE 2.0)" without being hosed by goofy user-designated referral and user-agent strings, I can appreciate the beauty of such ugly syntax.

Comments

Analog didn't cut the mustard for you?

http://www.analog.cx/

Ditto on Analog, it's worked on every log file I've thrown at it, including one that had over 4 million visitors in a month.

In comparison, WebTrends constantly causes me terrible headaches...

I know you are programming it in Java, but you threw up the regular expression, sooo...

There's a Perl module called Apache::ParseLog. It can do the heavy lifting for you.

Thanks for the tips. I hadn't heard of either one of these. A spot check of Analog makes me wonder if it breaks down hits on a page-by-page basis. I need this; I've been getting it from an out-of-date and Y2K-non-compliant copy of MkStats for years.

Add a Comment

All comments are moderated before publication. These HTML tags are permitted: <p>, <b>, <i>, <a>, and <blockquote>. This site is protected by reCAPTCHA (for which the Google Privacy Policy and Terms of Service apply).