UTF-8 — LLMpedia

UTF-8
Name	UTF-8
Standard	Unicode
Classification	Variable-width encoding
Created	1992
Creators	Ken Thompson, Rob Pike
Based on	ASCII

Contents

History and development
Design and features
Encoding scheme
Adoption and use
Comparison with other encodings

UTF-8. It is a variable-width character encoding standard used for electronic communication, defined as part of the Unicode standard. It was designed by Ken Thompson and Rob Pike at Bell Labs in 1992 to provide backward compatibility with the widely used ASCII encoding. UTF-8 has become the dominant character encoding for the World Wide Web, operating systems like Linux and macOS, and numerous internationalization protocols.

History and development

The development of UTF-8 was driven by the need for a practical encoding for the expanding Unicode standard, which aimed to encompass all characters from writing systems like the Latin alphabet, Cyrillic script, and Han characters. Prior multi-byte encodings, such as UTF-1 and the original UTF-2 proposal, were deemed inefficient or incompatible. The breakthrough came during a dinner in New Jersey where Ken Thompson sketched the design on a napkin, with Rob Pike implementing it shortly after at Bell Labs. This work was formally presented to the Unicode Consortium and later standardized in RFC 3629, superseding the earlier UTF-16 and UTF-32 for many applications due to its elegant design.

Design and features

A core design principle of UTF-8 is maintaining backward compatibility with the 7-bit ASCII character set, ensuring that any valid ASCII text is also valid UTF-8. It achieves this by using a variable-width structure where each code point from U+0000 to U+007F is encoded as a single byte. The encoding is self-synchronizing, allowing software to easily locate character boundaries even if starting mid-stream, a feature critical for robust data processing. It also avoids the issues of byte order marks that plague fixed-width encodings like UTF-16LE, though a BOM is permitted in specific contexts like Microsoft Windows.

Encoding scheme

The encoding scheme uses a prefix code system where the high bits of the first byte indicate the total number of bytes for a character. A single byte begins with a `0` bit, matching ASCII. A two-byte sequence, used for characters in blocks like the Greek alphabet, starts with `110` and `10`. Three-byte sequences can encode the bulk of the Basic Multilingual Plane, including scripts like Devanagari and Arabic script, beginning with `1110` and `10`. Four-byte sequences, starting with `11110` and `10`, are used for characters outside the BMP, such as many Emoji and historical scripts like Egyptian hieroglyphs. This structure is formally defined in the ISO/IEC 10646 standard.

Adoption and use

UTF-8 adoption was accelerated by its endorsement for use in HTML by the World Wide Web Consortium and its mandatory support in standards like XML. Major internet protocols, including HTTP and MIME, specify UTF-8 as the default charset. Operating systems such as Google's Android, Apple Inc.'s macOS, and most distributions of Linux use it as their primary encoding. Databases like MySQL and PostgreSQL, and programming languages including Python (programming language), Java (programming language), and Go (programming language) provide robust support. Its use is nearly universal on platforms like Twitter and Wikipedia.

Comparison with other encodings

Compared to other Unicode encodings, UTF-8 is more space-efficient for scripts based on the Latin alphabet, unlike UTF-16 which is more efficient for scripts like Japanese (using a mix of Hiragana and Kanji) or UTF-32 which uses fixed four-byte units. It avoids the endianness issues inherent in UTF-16BE and is more complex to decode than single-byte legacy encodings like Windows-1252 but far more capable. Unlike stateful encodings such as ISO-2022-JP, UTF-8 is stateless and stream-safe. Its dominance has largely supplanted older multi-byte encodings like Shift JIS and EUC-JP for international software development.

Category:Character encoding Category:Unicode Category:Internet standards