LLMpediaThe first transparent, open encyclopedia generated by LLMs

Unicode TR35

Generated by GPT-5-mini
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Expansion Funnel Raw 72 → Dedup 0 → NER 0 → Enqueued 0
1. Extracted72
2. After dedup0 (None)
3. After NER0 ()
4. Enqueued0 ()
Unicode TR35
NameTR35
TitleUnicode TR35
StatusPublished
OrganizationUnicode Consortium
DomainText encoding and processing
First published2005
Latest revision2023

Unicode TR35

Unicode TR35 is a technical report produced by the Unicode Consortium that provides normative guidance for implementing and profiling Unicode algorithms, particularly for Unicode Collation Algorithm, Unicode Normalization, and related string-processing behaviors. It serves as a bridge between the Unicode Standard and implementers in software projects, libraries, vendors, and consortia such as W3C, IETF, ISO/IEC, ECMA International, and IEEE. TR35 influences deployments across platforms including Microsoft Windows, Apple macOS, Google Android, Linux, and enterprise systems from Oracle Corporation and IBM.

Overview

TR35 organizes algorithmic requirements, data structure conventions, and profiling mechanisms used when applying Unicode algorithms in products and specifications. It complements the core Unicode Standard by specifying profiles and conventions relied upon by stakeholders like World Wide Web Consortium, IETF Working Group, and regional standards bodies such as ETS and CEN. The report addresses interoperability topics relevant to projects including Mozilla Firefox, Chromium Project, LibreOffice, Apache Software Foundation projects, and cloud platforms at Amazon Web Services and Microsoft Azure.

Terminology and Scope

TR35 defines specialized terms and scope boundaries for implementers, referencing terms established by Unicode Consortium publications, ISO/IEC 10646, and related specifications used by W3C Internationalization Working Group and IETF Language Tag efforts. It restricts its terminology to entities such as collation elements, normalization forms, grapheme clusters, and segmentation rules as applied in environments ranging from POSIX locales to Windows Registry locale identifiers and industry profiles used by CLDR and ICU Project. The scope explicitly excludes low-level protocol definitions maintained by IETF working groups and full linguistic annotation standards governed by TEI and ISO committees, while aligning with internationalization practices employed by Facebook, Twitter, and LinkedIn.

Algorithmic Specifications

TR35 details algorithmic profiles built atop core Unicode algorithms: the Unicode Collation Algorithm (UCA), Unicode Normalization Forms, Grapheme Cluster Boundaries, Word and Sentence Boundaries, and Bidirectional Algorithm behavior. Implementations referenced include ICU Project routines, glibc locale layers, Java Development Kit string libraries, and text processing stacks in Python Software Foundation implementations and Node.js. The report prescribes parameterization mechanisms such as tailoring rules, weight tables, and versioning conventions that impact products like Elasticsearch, PostgreSQL, MySQL, and search engines at Google. It also describes interaction with external libraries such as libxml2 and rendering systems like HarfBuzz and Pango.

Conformance and Test Data

TR35 specifies conformance criteria and test-data recommendations used by vendors and standards bodies. Test suites mentioned are produced and consumed by projects including Unicode Consortium test repositories, ICU Project regression suites, and community datasets used by Mozilla Foundation and Apache Software Foundation projects. Conformance provisions influence certification and compliance work in product lines from Oracle Corporation, SAP SE, and Red Hat. The report also references interoperability testing practices used in events organized by W3C Internationalization Working Group and IETF interoperability meetings, and it informs automated testing in continuous integration systems at GitHub and GitLab.

Implementation and Use Cases

Practical use cases for TR35 span text search, sorting, collation-sensitive databases, indexing, user-interface locale handling, and cross-platform data interchange. Implementers include database engines like PostgreSQL and MySQL, search systems like Apache Lucene and Elasticsearch, office suites such as LibreOffice and Microsoft Office, and web platform engines in Blink and Gecko. Mobile and embedded contexts appear in Android Open Source Project and device firmware from vendors like Samsung and Qualcomm. TR35-driven profiles are important for internationalized e-commerce at Alibaba Group and eBay, digital libraries at Library of Congress and British Library, and multilingual social platforms including YouTube and Instagram.

History and Revisions

TR35 has evolved through multiple revisions to reflect updates in the Unicode character database and algorithmic refinements coordinated with the Unicode Consortium release cycle. Editorial and technical contributions have come from working groups involving participants from Apple Inc., Microsoft Corporation, Google LLC, IBM Corporation, and independent experts often active in W3C and IETF discussions. Historical milestones align with major Unicode versions and with interoperability efforts driven by projects such as ICU Project and CLDR; maintenance actions address issues raised in public issue trackers used by Unicode Consortium and community repositories on GitHub. Recent revisions incorporate changes to normalization, collation tailoring, and segmentation compatibilities that reflect practices across platforms including Android, iOS, Windows, and major Unix-like systems.

Category:Unicode