Shift JIS — LLMpedia

Shift JIS
Name	Shift JIS
Standard	JIS X 0208, JIS X 0212, JIS X 0213
Classification	Extended ASCII, Double-byte character encoding
Based on	JIS X 0201
Extended from	JIS X 0201
Extended to	Microsoft Windows codepages, EUC-JP

Contents

History and development
Technical details
Character set
Usage and adoption
Compatibility and issues

Shift JIS. Shift JIS is a character encoding for the Japanese language, originally developed by the ASCII Corporation and later standardized by the Japanese Standards Association. It was designed to maintain backward compatibility with the single-byte JIS X 0201 standard while incorporating the double-byte JIS X 0208 character set. This encoding became the foundation for many Microsoft code pages and saw widespread use in personal computers and early internet systems.

History and development

The encoding was created in the early 1980s to address limitations in existing Japanese computing standards. Its development was driven by the need for a system that could seamlessly mix half-width kana from JIS X 0201 with full-width kanji from JIS X 0208. The Ministry of International Trade and Industry played a role in the broader ecosystem of Japanese industrial standards that included this format. A key motivation was ensuring compatibility with software and hardware designed for the American Standard Code for Information Interchange, which dominated the global computer industry. The specification was later formalized as JIS X 0208:1997 and influenced subsequent standards like JIS X 0213.

Technical details

The encoding scheme utilizes a variable-width structure where each character is represented by one or two bytes. The first byte ranges from 0x81 to 0x9F or 0xE0 to 0xFC, which signals the start of a double-byte sequence representing a JIS X 0208 character. Single-byte characters, such as those from the ASCII subset or half-width kana, occupy positions from 0x00 to 0x7E and 0xA1 to 0xDF, respectively. This design creates challenges for parsers and text editors because the second byte can overlap with values used for single-byte codes, complicating string processing. The layout was specifically engineered to avoid the C0 and C1 control codes space used in telecommunication protocols.

Character set

The primary repertoire is based on the JIS X 0208 standard, which includes thousands of kanji, hiragana, katakana, Cyrillic letters, Greek letters, and various symbols. It also fully incorporates the JIS X 0201 Roman character and half-width kana set. Some vendor extensions, like those from Microsoft in Code page 932, added characters such as the Euro sign and additional personal name kanji. The later JIS X 0212 supplement provided thousands more characters but was not fully supported in most implementations, leading to fragmented adoption. The Universal Coded Character Set project, specifically ISO/IEC 10646, ultimately aimed to supersede this fragmented landscape.

Usage and adoption

It became the *de facto* standard encoding for the Japanese-language versions of the Microsoft Windows operating system, notably in Windows 95 and Windows 98. The Mac OS operating system historically used its own Shift JIS art and emoji representations. It was extensively used in early Japanese web pages and bulletin board systems, particularly for creating ASCII art known as AA. Major companies like IBM and NEC implemented their own variants in systems such as the PC-9800 series. The encoding was also prevalent in the video game industry for titles released on platforms like the Super Nintendo Entertainment System and Sony PlayStation.

Compatibility and issues

A major issue is the presence of multiple, slightly incompatible variants, such as those from Microsoft, IBM, and Apple Inc., leading to mojibake or garbled text. The encoding's structure makes it susceptible to security vulnerabilities like the directory traversal attacks that exploited the second-byte values. Conversion to and from other encodings like EUC-JP and ISO-2022-JP is non-trivial and often lossy. The rise of Unicode standards, especially UTF-8, has largely supplanted its use on the modern World Wide Web and in internationalization efforts. Legacy systems running COBOL or maintaining mainframe computer databases still face significant migration challenges.

Category:Character encoding Category:Japanese computing Category:Japanese language