One of the things that helped me solve the problem was Kris Wehner's excellent weblog post on dealing with badly encoded character data using Java. His first suggestion is to use native2ascii, one of the lesser-known tools in Sun's Java 2 SDK:
native2ascii is the tool that converts from the binary UTF-8 encoding to an ASCII encoding with escapes, so it looks like uXXXX whenever there should be a non-ASCII character.
I had forgotten about this program, which turned 134 megabytes of data with an unknown character encoding into a UTF-8 file with a single command:
native2ascii -encoding "UTF-8" badOldFile goodNewFile
It was supposed to be UTF-8, according to some spotty documentation written by the programmers at Chef Moz, but it contained dozens if not hundreds of characters that weren't part of that set.
I couldn't figure out the actual character set. The data's filled with all kinds of bad characters -- control characters, nulls, and so on -- so I'm a bit mystified by the software used to collect the data from Web forms and create an XML output file.
This isn't just XOM. Any XML parser would choke on this. native2ascii might let you get the data in, but you'd want to check the results to find out where the bad characters were in the first place, and what they were doing there. You might have lost something. Are you sure the data was indeed UTF-8 and not something else?