Corpus of Historical Japanese Language

Corpus of Historical Japanese Language
Name	Corpus of Historical Japanese Language

Contents

Overview and Scope
Sources and Textual Content
Transcription, Annotation, and Encoding
Linguistic Features and Research Applications
Compilation Methodology and Editorial Principles
Accessibility, Licensing, and Tools

Corpus of Historical Japanese Language is a scholarly electronic collection designed to support research in historical linguistics, philology, and literary studies by aggregating annotated texts from premodern and early modern Japan. The corpus integrates materials tied to periods, texts, and institutions such as the Man'yōshū, Kojiki, Nihon Shoki, Heian period, and Edo period, facilitating comparative work linked to archives like the National Diet Library, libraries at Kyoto University, and projects associated with the International Research Center for Japanese Studies. It serves as a bridge between manuscript studies at repositories such as the Tokyo National Museum and computational analysis conducted at centers including the RIKEN and the Max Planck Institute for Human Cognitive and Brain Sciences.

Overview and Scope

The corpus covers materials from the Asuka period through the Meiji Restoration, emphasizing canonical texts such as the Man'yōshū, Kokin Wakashū, Genji Monogatari, and collections held by the Imperial Household Agency, while incorporating documents produced under administrations like the Tokugawa shogunate and exchanges recorded in the Treaty of Kanagawa era. It spans poetic, narrative, legal, and administrative genres represented in archives such as the Historiographical Institute, University of Tokyo and the Kokugakuin University Library, and aligns with comparative datasets from institutions like the British Library and the Library of Congress. Scope decisions reflect precedents set by corpora like the Perseus Digital Library, Project Gutenberg, and the Corpus of Historical American English.

Sources and Textual Content

Primary sources include imperial compilations such as the Shoku Nihongi, religious tracts preserved at Kōfuku-ji, diaries like the Tosa Nikki, court literature exemplified by the Makura no Sōshi, and legal codes comparable to the Taihō Code. Manuscript witnesses derive from holdings at the Nara National Museum, private archives of families such as the Fujiwara clan collections, and edited editions published by scholars affiliated with the Japan Academy and the Academia Sinica. The corpus also ingests marginalia and commentaries by figures like Kamo no Mabuchi, Motoori Norinaga, and Kenkō, alongside early modern commercial records linked to domains such as Edo and Satsuma Domain.

Transcription, Annotation, and Encoding

Transcription follows paleographic standards used by the Japan Center for Asian Historical Records and encoding adopts schemas influenced by the Text Encoding Initiative and initiatives at the World Wide Web Consortium. Annotation layers include morphological tagging referencing grammars by scholars like Samuel E. Martin, phonological remarks consonant with reconstructions by Bjarke Frellesvig, and glosses citing classical studies from the International Research Center for Japanese Studies. Unicode normalization aligns with code points used in National Institute of Information and Communications Technology projects, and TEI headers record provenance comparable to metadata practice at the Open Archives Initiative.

Linguistic Features and Research Applications

Researchers exploit the corpus to study syntactic change observable between texts like the Manyoshu and Tales of Ise, phonological developments discussed by Roy Andrew Miller, lexical shifts noted by Shōichi Kato, and morphosyntactic phenomena analyzed in dissertations from universities such as University of Tokyo and Kyoto University. Applications include stylometric analysis in the tradition of John Burrows, diachronic frequency studies akin to those using the Corpus of Contemporary American English, and machine learning models trained in collaboration with labs at Tokyo Institute of Technology and Osaka University. Interdisciplinary projects link to digital humanities centers like the Humanities Commons and computational linguistics conferences such as ACL.

Compilation Methodology and Editorial Principles

Editorial decisions adhere to practices championed by editorial projects like the Cambridge Histories, with provenance vetting comparable to procedures at the British Library and collation methods inspired by the Stemma Codicum tradition. Principles include transparent source citation, diplomatic transcription where appropriate, normalized lemma assignment following standards promoted by the International Phonetic Association for historical phonetics, and peer review processes modeled on journals such as Monumenta Nipponica. The editorial board typically includes specialists from institutions like the National Institute for Japanese Language and Linguistics, representatives from university presses such as University of Tokyo Press, and curators from museum partners.

Accessibility, Licensing, and Tools

Access policies balance open scholarship ideals exemplified by Creative Commons licensing with rights management negotiated with holders like the National Diet Library and private collectors linked to the Tokugawa family archives. Tools for use include concordancers inspired by the AntConc interface, visualization modules comparable to those developed at the Center for Digital Humanities, Princeton University, and APIs designed to interoperate with platforms such as GitHub and Zenodo. Outreach and training occur through workshops at venues like the International Congress of Asian and North African Studies and summer schools hosted by the International Research Center for Japanese Studies.

Category:Japanese language corpora Category:Historical linguistics