UTF-16 — LLMpedia

UTF-16
Name	UTF-16
Alias	UCS-2 successor
Developed	Unicode Consortium
Introduced	1990s
Classification	Variable-length encoding
Code units	16-bit
Max code point	U+10FFFF
Status	Widely used

Contents

Overview
Encoding Details
Endianness and Byte Order Mark
Surrogate Pairs and Supplementary Planes
Comparison with UTF-8 and UTF-32
Adoption and Compatibility
Implementations and Issues

UTF-16 UTF-16 is a 16-bit-based character encoding form designed by the Unicode Consortium as part of the Unicode character encoding architecture. It encodes the repertoire defined by the Universal Coded Character Set (UCS) and interoperates with technologies developed by ISO/IEC JTC 1 and ECMA International; implementers include projects and organizations such as Microsoft, Apple Inc., Google, Oracle Corporation, and Mozilla Foundation. UTF-16 balances compactness for many historic and modern scripts with the ability to represent additional characters introduced by standards like ISO/IEC 10646 and events such as the expansion of the Unicode Standard.

Overview

UTF-16 represents characters as one or two 16-bit code units drawn from the Unicode code space defined by Unicode Standard, compatible with the Universal Coded Character Set managed by ISO/IEC JTC 1/SC 2. The encoding evolved from UCS-2, responding to the need to represent supplementary characters added through processes overseen by the Unicode Consortium and standards activities like ISO/IEC 10646 revisions. Major software platforms and interfaces—operating systems such as Windows NT, libraries like ICU (International Components for Unicode), and APIs from POSIX-related projects—support UTF-16 alongside other encodings.

Encoding Details

UTF-16 maps each Unicode code point to either a single 16-bit code unit or a pair of 16-bit code units, using algorithms specified in the Unicode Standard annexes and ISO/IEC 10646. Single-unit encoding covers the Basic Multilingual Plane, which includes scripts and symbols cataloged by scholarly bodies and institutions such as Library of Congress and UNESCO for cultural heritage. Supplementary characters beyond U+FFFF are encoded with surrogate pairs computed using arithmetic defined in the standard; implementers follow algorithms similar to those in references used by W3C specifications and programming language standards like those from ECMA-262 and ISO/IEC 14882. The encoding preserves a direct numeric relationship with code points in the BMP, enabling relatively straightforward indexing in environments provided by companies like Sun Microsystems and tools like GDB.

Endianness and Byte Order Mark

Because UTF-16 uses 16-bit code units, byte order matters on processors from vendors such as Intel Corporation, ARM Holdings, Motorola, and PowerPC architectures. A Byte Order Mark (BOM), U+FEFF, is optionally used at the start of a stream to signal endianness; authors and implementers must consider recommendations from organizations like IETF and W3C regarding the BOM in protocols including HTTP and file formats such as those from Microsoft Office and OpenDocument Format. When files traverse networks or storage systems maintained by entities like Amazon Web Services or Google Cloud Platform, correct handling of little-endian and big-endian representations is essential for interoperability with platforms like Linux distributions and macOS.

Surrogate Pairs and Supplementary Planes

Supplementary planes, including the Supplementary Multilingual Plane and Supplementary Ideographic Plane established by proposals reviewed by the Unicode Consortium and ISO/IEC JTC 1/SC 2, contain historic, rare, and emoji characters whose code points exceed U+FFFF. UTF-16 represents these using surrogate pairs: a high surrogate from the range U+D800–U+DBFF and a low surrogate from U+DC00–U+DFFF, an approach standardized and documented by Unicode Technical Committee outputs and discussed in implementations from Microsoft and IBM. Proper handling of surrogate pairs is critical in text processing libraries like libxml2, glibc, and language runtimes such as Java Virtual Machine, .NET Framework, and Python implementations.

Comparison with UTF-8 and UTF-32

Compared with UTF-8, an encoding favored by the Internet Engineering Task Force and widely used in Linux and web ecosystems, UTF-16 offers different trade-offs: UTF-16 is often more compact than UTF-8 for East Asian scripts commonly cataloged by institutions like National Institute of Standards and Technology and China National Information Center, while UTF-8 is byte-oriented and ubiquitous in protocols standardized by IETF and platforms run by Google and Facebook. UTF-32, which uses one 32-bit code unit per code point and was standardized in ISO/IEC 10646, is simpler for random access but less space-efficient; systems such as HPC environments and research projects sometimes prefer UTF-32 for unambiguous indexing. Standards bodies and vendors, including Unicode Consortium reports and platform vendors like Microsoft and Apple Inc., provide guidance on choosing encodings.

Adoption and Compatibility

UTF-16 has seen extensive adoption in major operating systems, application frameworks, and file formats—Windows NT APIs historically used 16-bit code units, while runtime environments like Java Platform, Standard Edition and .NET use UTF-16-based internal string representations. Document formats and protocols—from XML recommendations by W3C to office formats by Microsoft Office and OpenDocument Format—support UTF-16. Interoperability work by organizations such as IETF, W3C, and Unicode Consortium ensures that converters, libraries, and tools from vendors including Oracle Corporation, Google, and Mozilla Foundation can transcode between UTF-16 and encodings like UTF-8.

Implementations and Issues

Implementations of UTF-16 appear in language runtimes (for example, in Java Virtual Machine and .NET Framework), text libraries such as ICU, rendering engines like Blink and WebKit, and database systems including MySQL and PostgreSQL which provide UTF-16-capable collations. Common issues include incorrect handling of unpaired surrogates, mishandled BOMs, and indexing mistakes in APIs maintained by vendors such as Microsoft and projects like Boost. Security advisories from organizations including CERT and NIST document vulnerabilities arising from improper UTF-16 processing, prompting fixes in compilers, interpreters, and networking stacks from projects like GCC and OpenSSL.

Category:Unicode