BEIR — LLMpedia

BEIR
Name	BEIR
Type	Benchmark
Founded	2021
Focus	Information retrieval evaluation
Country	International

Contents

Background and purpose
Dataset composition and benchmarks
Evaluation tasks and metrics
Experimental results and findings
Limitations and criticisms

BEIR

BEIR is a benchmark suite created to evaluate information retrieval models across diverse domains, tasks, and data distributions. It was introduced to provide a standardized framework for comparing neural and non-neural retrievers on robustness, generalization, and cross-domain performance. Developed by researchers associated with academic and industrial institutions, BEIR quickly became a reference point for work on dense retrieval, sparse retrieval, and transfer learning in retrieval research.

Background and purpose

BEIR was motivated by prior efforts to benchmark retrieval systems such as Text Retrieval Conference, TREC Deep Learning Track, MS MARCO, and datasets from groups like NIST, Google Research, and Microsoft Research. The initiative sought to bring together heterogeneous corpora and tasks influenced by applications in Natural Language Processing, Question Answering, Conversational AI, and Biomedical literature mining to test models beyond training distributions exemplified by corpora like MS MARCO and SQuAD. Key goals included assessing generalization reported by teams at venues such as ACL, EMNLP, and NeurIPS and enabling reproducible comparisons similar to those achieved in evaluations like ImageNet for computer vision and GLUE for language understanding.

Contributors and adopters have included researchers affiliated with institutions such as Facebook AI Research, University of Amsterdam, University of Washington, Allen Institute for AI, and industrial labs like Huawei Noah's Ark Lab and Amazon Science. The design of BEIR aligns with the broader movement in the community exemplified by workshops at SIGIR, ECIR, and IUI to promote robust, cross-domain evaluation.

Dataset composition and benchmarks

BEIR aggregates multiple datasets spanning different domains, genres, and annotation styles. Included collections draw inspiration from and sometimes directly include datasets related to NQ (Natural Questions), HotpotQA, TREC-COVID, FiQA, BioASQ, SciDocs, and other corpora used in retrieval and QA research. The suite comprises passage-level and document-level corpora, queries originating from crowdsourcing initiatives, expert annotations from challenges like CLEF and BioNLP, and query-document relevance judgments generated in evaluation campaigns such as TREC and NIST] competitions. BEIR also contains datasets reflecting tasks encountered in applied settings like e-commerce and social media research linked to organizations such as Amazon, Twitter, and Reddit data studies.

Each constituent dataset preserves its original train/validation/test splits where available, facilitating zero-shot evaluation when models are trained on external sources such as MS MARCO or ColBERT training sets. The benchmark provides standardized retrieval corpora, query sets, and qrels to ensure that experiments from groups at institutions like Stanford University, Carnegie Mellon University, and ETH Zurich remain comparable.

Evaluation tasks and metrics

BEIR frames multiple retrieval tasks including classical ad hoc retrieval, dense retrieval for passage ranking, retrieval-augmented retrieval for open-domain question answering, and retrieval for domain-specific information needs such as biomedical fact-finding and legal discovery. Representative tasks echo evaluation formats used in campaigns like TREC-COVID for pandemic-related search, BioASQ for biomedical retrieval, and CLEF eHealth for medical information retrieval.

Evaluation metrics provided by BEIR mirror those widely adopted in IR and NLP communities: normalized Discounted Cumulative Gain (nDCG) as used in TREC evaluations, recall@k common in rankings evaluated by groups at Microsoft Research, mean Reciprocal Rank (MRR) popularized in question answering tasks associated with SQuAD and Natural Questions, and Precision@k used in industrial settings at companies like Google and Amazon. BEIR also encourages reporting latency and index size to reflect practical trade-offs considered by teams attending venues such as SIGIR and WWW.

Experimental results and findings

Early experiments with BEIR revealed systematic trends: dense retrievers based on architectures from BERT and variants such as RoBERTa and DistilBERT excel on in-domain datasets like MS MARCO but often degrade in zero-shot settings on out-of-domain collections inspired by BioASQ and TREC-COVID. Sparse methods leveraging term weighting schemes and models building on ideas from BM25 and systems used at Elasticsearch or Lucene sometimes outperform dense models on certain datasets with lexical overlap. Hybrid approaches that combine dense encoders with sparse representations, similar to techniques explored at Facebook AI Research and Google Research, frequently yield more robust performance across BEIR’s heterogeneous corpus.

Comparative studies involving methods such as DPR (Dense Passage Retrieval), ANCE, and re-ranking pipelines using cross-encoders fine-tuned on datasets like MS MARCO indicate that re-ranking with cross-attention models often improves top-k metrics at the cost of higher latency, a trade-off discussed in literature from NeurIPS and ACL proceedings. Results published by research groups at University College London, University of Cambridge, and Tsinghua University show that dataset-specific fine-tuning or domain adaptation can recover much of the performance gap on biomedical and scientific collections.

Limitations and criticisms

Critics note that BEIR, while broad, cannot encompass all domain-specific retrieval challenges and may reflect biases inherent in its constituent datasets drawn from sources like Reddit or commercial platforms such as Twitter and Amazon. Some researchers from Princeton University and Harvard University have argued that reliance on automatic pairing of queries and documents and on legacy qrels limits evaluation of conversational and multi-turn retrieval scenarios explored in venues like SIGDIAL and AAAI. Concerns about evaluation granularity, such as the adequacy of metrics like MRR for complex relevance definitions used in legal and clinical settings associated with institutions like The Lancet and WHO, have been raised.

Additionally, engineering constraints—indexing choices, hardware variability reported by labs at Google Research and Microsoft Research—can affect reproducibility. Ongoing work by scholars at ETH Zurich, Imperial College London, and industrial partners aims to extend BEIR with more interactive, temporally-aware, and privacy-preserving datasets to address these critiques.

Category:Information retrieval benchmarks