Recently, I was struggling in a Java programming project with a huge XML file that was supposed to be in UTF-8 format but contained some non-UTF-8 characters. This prevented it from being parsed by XOM, an open-source XML processing library that militantly rejects non-well-formed XML.
One of the things that helped me solve the problem was Kris Wehner's excellent weblog post on dealing with badly encoded character data using Java. His first suggestion is to use native2ascii, one of the lesser-known tools in Sun's Java 2 SDK:
native2ascii is the tool that converts from the binary UTF-8 encoding to an ASCII encoding with escapes, so it looks like uXXXX whenever there should be a non-ASCII character.
I had forgotten about this program, which turned 134 megabytes of data with an unknown character encoding into a UTF-8 file with a single command:
native2ascii -encoding "UTF-8" badOldFile goodNewFile
