Imagine you’ve got some text you’ve been told is ASCII and you’ve
told java that it’s ASCII using:
Reader reader = new InputStreamReader(inputstream, "ASCII");
Imagine your surprise when it happily reads in non-ascii values, say
UTF-8 or ISO8859-1, and converts them to a random character.
import java.io.*; public class Example1 { public static void main(String[] args) { try{ FileInputStream is = new FileInputStream(args[0]); BufferedReader reader = new BufferedReader(new InputStreamReader(is, args[1])); String line; while ((line = reader.readLine()) != null) { System.out.println(line); } } catch (Exception e) { System.out.println(e); } } }
beebo david% java Example1 utf8file.txt ascii I��t��rn��ti��n��liz��ti��n beebo david% java Example1 utf8file.txt utf8 Iñtërnâtiônàlizætiøn
So, I hear
you ask, how do you get Java to be strict about the conversion. Well, answer
is to lookup a Charset object, ask it for a CharsetDecoder object and
then set the onMalformedInput option to
CodingErrorAction.REPORT. The resulting code is:
import java.io.*; import java.nio.charset.*; public class Example2 { public static void main(String[] args) { try{ FileInputStream is = new FileInputStream(args[0]); Charset charset = Charset.forName(args[1]); CharsetDecoder csd = charset.newDecoder(); csd.onMalformedInput(CodingErrorAction.REPORT); BufferedReader reader = new BufferedReader(new InputStreamReader(is, csd)); String line; while ((line = reader.readLine()) != null) { System.out.println(line); } } catch (Exception e) { System.out.println(e); } } }
This time when we run it,we get:
beebo david% java Example2 utf8file.txt ascii java.nio.charset.MalformedInputException: Input length = 1 beebo david% java Example2 utf8file.txt utf8 Iñtërnâtiônàlizætiøn
On a slightly related note, if anyone knows how to get Java to decode
UTF32, VISCII, TCVN-5712, KOI8-U or KOI8-T, I would love to know.
Update: (2007-01-26) Java 6 has support for UTF32
and KOI8-U.