Generated by GPT-5-mini| COCO Captions | |
|---|---|
| Name | COCO Captions |
| Type | Dataset |
| Subject | Image captioning |
| Released | 2014 |
| Creators | Microsoft Research, Visual Geometry Group |
| Languages | English |
COCO Captions COCO Captions is a widely used image-captioning dataset developed in 2014 for machine learning and computer vision research, enabling supervised training and evaluation of caption-generation models. The dataset has been cited in publications from institutions such as Microsoft Research and used in challenges associated with venues like CVPR, ECCV, and NeurIPS. Researchers from groups including the Visual Geometry Group, University of Oxford, and teams at Facebook AI Research and Google Research have adopted it for benchmarking captioning, detection, and multimodal representation tasks.
COCO Captions originated as an extension of the Common Objects in Context project and provides human-authored English captions for natural images to support tasks in computer vision, natural language processing, and multimodal learning. The dataset sits alongside contemporaneous resources such as ImageNet, PASCAL VOC, Open Images Dataset, and influenced datasets like Flickr30k and Visual Genome. COCO Captions has been integrated into leaderboards and workshops at conferences including ICCV, ICLR, and AAAI and has been used by teams from Stanford University, MIT, Carnegie Mellon University, and ETH Zurich.
The dataset comprises tens of thousands of color photographs drawn from everyday scenes, annotated with multiple independent English captions per image; it complements object instance annotations used in the broader COCO effort, which were produced by contributors coordinated through platforms similar to Amazon Mechanical Turk. Images depict settings familiar from datasets like SUN Database and Places (dataset), with annotations capturing objects and interactions comparable to entries in Visual Genome and Flickr8k. The split into training, validation, and test sets follows conventions adopted in challenges organized by Microsoft Research and benchmarked by research groups at University of California, Berkeley and University of Toronto.
Annotations were collected by crowdsourced workers using instruction protocols influenced by practices at Amazon Mechanical Turk and quality-control methodologies comparable to those used by teams at Labelbox and academic labs like Princeton University and University of Washington. Each image received multiple independent captions to capture variability akin to human descriptions in corpora created at Stanford NLP Group and Johns Hopkins University. Quality assurance measures echo evaluation schemes from projects associated with DARPA-funded efforts and reproducibility standards advocated by researchers at University of Pennsylvania and Harvard University.
Performance on the dataset is typically measured with automatic metrics such as BLEU, METEOR, ROUGE, CIDEr, and SPICE, all used in benchmark comparisons at conferences like EMNLP and ACL. Leaderboards often report scores achieved by models including variants of Show and Tell (model), Show, Attend and Tell, transformer-based architectures from Google Research, OpenAI, and multimodal models from DeepMind. Comparative evaluations reference baselines established in papers from institutions such as University of Montreal, Tsinghua University, and Peking University.
COCO Captions has been used to train and evaluate image captioning models, to pretrain multimodal encoders deployed in systems by Facebook AI Research and Google Research, and as a component in research on visual question answering pursued at Microsoft Research Asia and Allen Institute for AI. Applied work leveraging the dataset appears in prototypes developed by teams at NVIDIA and Adobe Research for accessibility tools and automated image description services used by organizations like National Federation of the Blind and initiatives connected to World Wide Web Consortium. It also serves as a benchmark for transfer learning in multimodal models produced by labs at University of Cambridge and Imperial College London.
Analyses of COCO Captions identify limitations similar to those found in datasets studied by Joy Buolamwini and Timnit Gebru: representation skew, annotator bias, and cultural bias in caption content. Studies by research groups at MIT Media Lab, University of Oxford, and Cornell University highlight issues such as underrepresentation of demographic diversity and tendencies to encode stereotypical roles, echoing findings reported for datasets like ImageNet and Open Images Dataset. These limitations affect downstream models developed by teams at Google DeepMind, Facebook AI Research, and Microsoft Research and motivate dataset auditing practices promoted by organizations like Partnership on AI and initiatives such as Data Nutrition Project.
The dataset was released under terms that facilitated academic and noncommercial research use, similar to licenses applied by datasets from Stanford University and Berkeley Artificial Intelligence Research (BAIR), and is distributed via channels used by communities around GitHub, Kaggle, and institutional repositories maintained by Microsoft Research. Access policies have been subject to updates in response to community concerns about privacy and misuse, following precedents set by datasets curated at Harvard Data Science Review and guidance from regulatory discussions involving European Commission and U.S. National Institutes of Health.
Category:Datasets