unicode - Why does the Java ecosystem use different character encodings throughout their software stack? - Stack Overflow



unicode - Why does the Java ecosystem use different character encodings throughout their software stack? - Stack Overflow

At the time Java was created, the class file format used UTF-8 and the runtime used UCS-2. Unicode had less than 64k codepoints, so 16 bits was enough. Later, when additional "planes" were added to Unicode, UCS-2 was replaced with the (pretty much) compatible UTF-16, and UTF-8 was replaced with CESU-8 (hence "Compatibility Encoding Scheme...").

In the class file format they wanted to use UTF-8 to save space. The design of the class file format (including the JVM instruction set) was heavily geared towards compactness.

In the runtime they wanted to use UCS-2 because it was felt that saving space was less important than being able to avoid the need to deal with variable-width characters. Unfortunately, this kind of backfired now that it's UTF-16, because a codepoint can now take multiple "chars", and worse, the "char" datatype is now sort of misnamed (it no longer corresponds to a character, in general, but instead corresponds to a UTF-16 code-unit).

share|edit|flag

Read full article from unicode - Why does the Java ecosystem use different character encodings throughout their software stack? - Stack Overflow


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts