BioSchemas — LLMpedia

BioSchemas
Name	BioSchemas
Formation	2015
Type	Community-driven specification
Purpose	Improve life-sciences metadata for discovery and interoperability
Headquarters	Europe
Region served	Global

Contents

Overview
Schema specifications and components
Implementation and adoption
Tools and resources
Governance and community
Use cases and examples

BioSchemas BioSchemas is a community initiative to create tailored metadata profiles for life-sciences resources, enhancing discoverability, interoperability, and reuse across platforms such as search engines, data repositories, and scholarly aggregators. It builds on web standards and aligns with major organizations and infrastructures to bridge resources from laboratories, publishers, and data centers to indexing services and knowledge graphs.

Overview

BioSchemas emerged from collaborations among actors including the European Bioinformatics Institute, National Center for Biotechnology Information, ELIXIR, EMBL-EBI, Google, Wikidata, FAIRsharing, and the RDA. The project leverages standards from the World Wide Web Consortium, notably schema.org, to define profiles for domain concepts like assays, datasets, software, and samples. Contributors have included staff from institutions such as Wellcome Trust Sanger Institute, University of Oxford, University of Cambridge, Max Planck Society, European Molecular Biology Laboratory, and companies like Elsevier and Microsoft Research. The effort aligns with policy drivers from funders such as the European Commission, National Institutes of Health, and initiatives like GO FAIR and ELIXIR-CONVERGE to make life-science outputs more FAIR-compliant. BioSchemas interacts with repositories and platforms including Zenodo, Dryad, Figshare, GitHub, ORCID, and indexing systems like Google Scholar and the Semantic Web ecosystem.

Schema specifications and components

The specifications define profiles that extend schema.org types by constraining properties, cardinalities, and value types for life-science use cases. Core components include entity types (e.g., Dataset, SoftwareSourceCode, Protocol), controlled properties (e.g., creator, license, version), and crosswalks to vocabularies maintained by organizations like the UniProt Consortium, Gene Ontology Consortium, Sequence Read Archive, and the Protein Data Bank. Profiles reference identifiers used by registries such as DOI, BioSample, MGnify, and ArrayExpress to ensure unambiguous linkage. The specifications incorporate provenance elements compatible with the PROV-O model and align with metadata schemas from publishers including PLOS, Nature Research, and Science (journal), enabling integration with citation infrastructure such as Crossref and DataCite.

Implementation and adoption

Adoption spans research infrastructures, journals, repositories, and software platforms. Implementers include digital archives like EMPIAR, sequence repositories such as European Nucleotide Archive, institutional repositories at universities like University of Cambridge and Harvard University, and laboratory information management systems used at institutions including Cold Spring Harbor Laboratory and Broad Institute. Publishers and platforms—BioRxiv, PubMed Central, Wiley, Springer Nature—have been engaged to improve article metadata. Search and indexing support from Google, Bing, and knowledge graph initiatives like Wikidata and DBpedia increases visibility. Community pilots have been run with funders and infrastructures like Wellcome Trust, Horizon 2020, and ELIXIR to assess impacts on discoverability and citation.

Tools and resources

A suite of tools supports authoring, validation, and harvesting of profiles. Validation utilities integrate with standards tooling from W3C and include command-line validators, JSON-LD editors, and plugins for platforms such as WordPress, GitLab, and Jupyter Notebook. Harvesters and converters map repository metadata to profiles used by Zenodo, Figshare, Dryad, and institutional repositories at Massachusetts Institute of Technology and Stanford University. Training materials and documentation have been produced with partners like FAIRsharing, ELIXIR-CONTRIBUTORS, and the Global Biodata Coalition. Software libraries in languages supported by communities around Python (programming language), JavaScript, and R (programming language) facilitate integration with analytical platforms like Galaxy and workflow systems such as Nextflow.

Governance and community

The initiative is community-driven with governance via working groups, steering committees, and periodic hackathons organized alongside conferences like Bioinformatics Open Source Conference, ISMB, and ELIXIR All Hands. Participating organizations include research infrastructures such as ELIXIR, INSDC, and EBI, funding bodies such as Wellcome Trust and European Commission, and industry partners including Google and Elsevier. Contributions are coordinated through repositories on GitHub and community discussion on platforms like Slack and mailing lists hosted by organizations such as EMBL. The project engages standards bodies like W3C and collaborates with registries including Identifiers.org and BioSchemas Community working groups.

Use cases and examples

Use cases cover machine-readable descriptions for datasets deposited in Dryad, software descriptions for code hosted on GitHub and archived via Zenodo, protocol metadata for workflows in Protocols.io, and sample descriptions linked to BioSample and ArrayExpress. Examples include enhanced metadata for structural biology entries in the Protein Data Bank, metagenomics datasets in MGnify, and variant datasets in projects like 1000 Genomes Project and Cancer Genome Atlas. Integration with author identifiers such as ORCID and citation services like Crossref improves attribution. Implementations have demonstrated better indexing in services like Google Dataset Search and semantic integration with knowledge bases like Wikidata and DBpedia.

Category:Metadata standards