Recently, I was struggling in a Java programming project with a huge XML file that was supposed to be in UTF-8 format but contained some non-UTF-8 characters. This prevented it from being parsed by XOM, an open-source XML processing library that militantly rejects non-well-formed XML.

One of the things that helped me solve the problem was Kris Wehner's excellent weblog post on dealing with badly encoded character data using Java. His first suggestion is to use native2ascii, one of the lesser-known tools in Sun's Java 2 SDK:

native2ascii is the tool that converts from the binary UTF-8 encoding to an ASCII encoding with escapes, so it looks like uXXXX whenever there should be a non-ASCII character.

I had forgotten about this program, which turned 134 megabytes of data with an unknown character encoding into a UTF-8 file with a single command:

native2ascii -encoding "UTF-8" badOldFile goodNewFile

-- Rogers Cadenhead

Comments

It was supposed to be UTF-8, according to some spotty documentation written by the programmers at Chef Moz, but it contained dozens if not hundreds of characters that weren't part of that set.

I couldn't figure out the actual character set. The data's filled with all kinds of bad characters -- control characters, nulls, and so on -- so I'm a bit mystified by the software used to collect the data from Web forms and create an XML output file.


 

This isn't just XOM. Any XML parser would choke on this. native2ascii might let you get the data in, but you'd want to check the results to find out where the bad characters were in the first place, and what they were doing there. You might have lost something. Are you sure the data was indeed UTF-8 and not something else?


 

Add a Comment

These HTML tags are permitted: p, b, i, a, and blockquote. A comment may not include more than three links. Participants in this discussion should note the site's moderation policy.

:
:
: