HumanEval — LLMpedia

HumanEval
Name	HumanEval
Type	Dataset
Creator	OpenAI
Introduced	2021
Domain	Code generation, Programming
License	MIT (examples)

Contents

Overview
Dataset Composition
Evaluation Methodology
Results and Benchmarks
Limitations and Biases
Impact and Applications
History and Development

HumanEval

HumanEval is a benchmark dataset for assessing program synthesis and code-generation systems. It provides Python programming problems paired with unit tests to measure functional correctness and model generalization. The dataset has been used in evaluations of large language models, code models, and automated programming systems across research institutions and industry labs.

Overview

HumanEval was created to quantify the ability of generative models to produce correct, executable code in Python, aligning with evaluations by organizations such as OpenAI, Google Research, DeepMind, Microsoft Research, and Meta Platforms. The dataset is cited in work from teams associated with Stanford University, Massachusetts Institute of Technology, Carnegie Mellon University, University of California, Berkeley, and University of Washington. HumanEval has been incorporated into benchmarks alongside suites like CodeXGLUE, MBPP, APPS (dataset), POJ, and HackerRank-style evaluation tasks used by companies such as GitHub and Amazon Web Services.

Dataset Composition

HumanEval consists of a collection of programming tasks with problem descriptions, function signatures, and unit tests. Contributors and reviewers from institutions like OpenAI, Stanford University, University of Toronto, University of Oxford, and ETH Zurich have influenced its structure. The examples include varying difficulty levels influenced by curricula at Massachusetts Institute of Technology, challenge sets from International Olympiad in Informatics, and coding contest styles from platforms like Codeforces and LeetCode. Licensing and distribution decisions reference precedents set by MIT License repositories hosted on platforms like GitHub and archival practices at arXiv and Zenodo.

Evaluation Methodology

Evaluation with HumanEval typically runs model-generated code through provided unit tests to determine pass@k metrics, a statistical estimator popularized in papers from OpenAI, Google Research, DeepMind, and groups at Carnegie Mellon University. The pass@k metric builds on earlier statistical ideas used in research at Stanford University and evaluation frameworks developed by labs such as Microsoft Research and Facebook AI Research. Experimental setups often replicate environments from cloud providers like Amazon Web Services, Google Cloud Platform, and Microsoft Azure for reproducibility, and reporting follows standards seen at conferences like NeurIPS, ICML, ICLR, and EMNLP.

Results and Benchmarks

Published results using HumanEval appear in papers by teams at OpenAI, Google DeepMind, Meta Platforms, Microsoft Research, and academic groups from MIT, Stanford, and Berkeley. Benchmarks compare model families including architectures from GPT-3, Codex, PaLM, LLaMA, and Transformer-based designs developed at Google Brain and OpenAI. Comparative work references datasets and competitions such as CodeXGLUE, MBPP, APPS (dataset), and evaluation protocols highlighted at venues like NeurIPS and ICLR. Results often report exact match rates, functional correctness, and pass@k under sampling regimes used in industrial evaluations by GitHub Copilot teams and research groups at DeepMind.

Limitations and Biases

HumanEval has limitations noted by researchers at OpenAI, Stanford University, MIT, CMU, and ETH Zurich including narrow focus on Python, synthetic problem wording influenced by examples from LeetCode and GeeksforGeeks, and reliance on unit tests that may not capture specification completeness. Concerns echo critiques in literature from ACM and conference discussions at NeurIPS, ICLR, and SIGPLAN. Bias analyses reference tooling and dataset curation practices from organizations like GitHub, OpenAI, Google, and academic audits from Harvard University and Yale University that emphasize dataset representativeness, license provenance, and language- or culture-specific framing.

Impact and Applications

HumanEval influenced model development and evaluation across industry and academia, informing products and research at OpenAI, GitHub Copilot (from GitHub and Microsoft), Google Research initiatives, and code intelligence work at Amazon Web Services and Meta Platforms. It has been used to train, fine-tune, and evaluate systems in collaborations involving Stanford University, MIT, CMU, and corporate research labs. HumanEval-derived metrics appear in papers at conferences such as NeurIPS, ICLR, ICML, and in workshops associated with ACL and EMNLP.

History and Development

HumanEval was introduced in publications and technical reports from researchers at OpenAI and collaborators, building on prior program-synthesis datasets from groups at Google Research, Microsoft Research, Stanford University, and MIT. Its development parallels advances in large-scale transformer models from OpenAI, Google Brain, DeepMind and dataset engineering traditions at GitHub and archival practices at arXiv. Subsequent community work and critiques have involved contributions and analyses from researchers at ETH Zurich, Carnegie Mellon University, Harvard University, Yale University, and others engaging in reproducibility efforts at venues like NeurIPS and ICLR.

Category:Datasets