ACL Special Interest Group on Linguistic Annotation

ACL Special Interest Group on Linguistic Annotation
Name	ACL Special Interest Group on Linguistic Annotation
Abbreviation	SIGLA
Formation	1990s
Type	Special interest group
Headquarters	Philadelphia
Parent organization	Association for Computational Linguistics

Contents

ACL Special Interest Group on Linguistic Annotation The ACL Special Interest Group on Linguistic Annotation is a community within the Association for Computational Linguistics dedicated to the development, evaluation, and dissemination of linguistic annotation standards, tools, and corpora. It coordinates efforts among researchers and practitioners affiliated with institutions such as Stanford University, Massachusetts Institute of Technology, University of Cambridge, University of Pennsylvania, and University of Edinburgh and interfaces with projects at Google, Microsoft Research, Facebook AI Research, IBM Research, and Amazon Web Services. The group has influenced annotation work used in initiatives including Penn Treebank, OntoNotes, Universal Dependencies, PropBank, and WordNet.

History

The group traces origins to early collaborations among scholars linked to Brown Corpus, Treebank, DARPA, ACL Anthology, European Language Resources Association, and workshops co-located with conferences such as ACL (conference), COLING, EMNLP, NAACL, and LREC. Founders and early contributors included researchers from Carnegie Mellon University, Johns Hopkins University, University of Toronto, Princeton University, Yale University, and University of Texas at Austin, who built on annotations used in projects like Susanne Corpus, CUS and corpora developed at British National Corpus centers. The SIG's formation paralleled initiatives led by organizations such as ISO, ELRA, National Institute of Standards and Technology, and European Commission funding programs that prioritized shared linguistic resources.

The SIG's objectives span defining annotation schemas relevant to communities working with datasets produced by Google Books, Project Gutenberg, Wikimedia Foundation, Europarl, and digital archives at Library of Congress. It aims to standardize annotation layers—morphology, syntax, semantics, discourse, and pragmatics—used by teams at Columbia University, University of California, Berkeley, University of Washington, University of Michigan, and University of Oxford. The scope explicitly covers interoperability across frameworks such as TEI, RDF, OWL, JSON-LD, and links to resources maintained by LDC, ELRA, CENDARI, and national corpora like Corpus of Contemporary American English.

SIG activities include coordinated annotation campaigns, shared tasks influenced by competitions like SemEval, CoNLL, BioNLP, Text REtrieval Conference, and collaborative projects with labs at MIT CSAIL, Allen Institute for AI, SRI International, Max Planck Institute for Psycholinguistics, and INRIA. Projects have produced multilingual corpora drawing on efforts from Europarl, OPUS, Common Voice, Global Voices, and language documentation initiatives affiliated with UNESCO and Endangered Languages Project. The SIG sponsors tool development used by teams at Universität Stuttgart, University of Leiden, University of Helsinki, and Universidad Nacional Autónoma de México, integrating software such as UIMA, GATE, Brat, INCEpTION, and ELAN.

SIG outputs include recommendations that informed standards like Universal Dependencies, ISO 24619, ISOcat, and annotation models used in Penn Treebank, PropBank, FrameNet, OntoNotes, and WordNet. It curates resources deposited with Linguistic Data Consortium, ELRA, and repositories at GitHub as well as corpora cited in proceedings of ACL (conference), NAACL, EMNLP, and LREC. The SIG advances metadata and licensing practices aligning with frameworks used by Creative Commons, Open Data Commons, ORCID, and institutional repositories at Harvard University, MIT Libraries, and British Library.

Governance follows practices adopted by peer groups in the ACL ecosystem, with elected officers drawn from universities and labs such as Stanford University, Johns Hopkins University, University of Edinburgh, University of Maryland, Microsoft Research, and Google Research. Membership comprises academics, industry researchers, and curators affiliated with LDC, ELRA, NIST, Max Planck Institute for Informatics, Facebook AI Research, and consortia like CLARIN and META-NET. The SIG coordinates with committees and working groups modeled after governance at Association for Computational Linguistics, IEEE, and ACM.

The SIG organizes special sessions and workshops co-located with major events such as ACL (conference), EMNLP, COLING, LREC, NAACL, ICASSP, and NeurIPS tutorials. It runs recurring tutorials taught by experts from Stanford University, MIT, Princeton University, Johns Hopkins University, and University of Cambridge and sponsors shared tasks akin to CoNLL Shared Task and SemEval tracks. Workshops have featured keynote speakers from Google Research, Microsoft Research, Facebook AI Research, IBM Research, and DeepMind.

The SIG's work influenced annotation practices in major corpora and toolchains adopted across projects at Stanford CoreNLP, spaCy, NLTK, AllenNLP, Hugging Face, TensorFlow, and PyTorch-based pipelines. Its standards improved interoperability for initiatives at European Commission research programs, archival projects at Library of Congress, and multilingual efforts such as Common Voice and Wikidata. Alumni and contributors have held positions at Google, Facebook, Amazon, Apple, Microsoft, and universities including Harvard University, University of Oxford, University of Cambridge, and Stanford University, propagating best practices in annotation across industry and academia.