Generated by GPT-5-mini| YouTube-8M | |
|---|---|
| Name | YouTube-8M |
| Type | Video dataset |
| Creator | Google Research |
| Released | 2016 |
| Domain | Computer vision, Machine learning |
| Size | ~8 million videos |
| Format | Precomputed features, Labels |
YouTube-8M
YouTube-8M was a large-scale labeled video dataset released to accelerate research in Google Research, Stanford University, University of California, Berkeley, University of Oxford and industrial labs such as DeepMind and Facebook AI Research. It provided precomputed audio-visual features and multi-label annotations drawn from public YouTube videos, enabling experiments across tasks used by teams at Microsoft Research, Amazon Research, IBM Research, OpenAI, and academic groups at Massachusetts Institute of Technology, Carnegie Mellon University, California Institute of Technology, and Princeton University. The dataset influenced benchmarks referenced by participants in venues including NeurIPS, ICCV, CVPR, ECCV, and ICML.
The dataset comprised millions of public domain or Creative Commons-licensed clips harvested from YouTube and mapped to a taxonomy derived from Knowledge Graph entities and metadata used by Google. Files distributed by the originating group included frame-level and video-level feature vectors prepared using models inspired by architectures from Inception, VGG, and audio pipelines comparable to those used in WaveNet. Release materials were accompanied by baseline code implemented in frameworks like TensorFlow, PyTorch, and experimental wrappers used by researchers at University of Toronto and ETH Zurich.
Collections in the release consisted of approximately eight million video identifiers paired with multi-label annotations drawn from thousands of entity classes present in the Knowledge Graph and reflective of popular content on YouTube. For each video the project provided pre-extracted frame-level visual embeddings and audio embeddings computed with convolutional and recurrent components similar to those in Inception-v3, ResNet, and recurrent models inspired by LSTM and GRU research from teams at Google Brain and NYU. Labels were associated with entities comparable to entries in Wikidata, DBpedia, Getty Images taxonomies and cross-referenced with signals used by Google Search and Google Trends. The data distribution covered global uploads tied to locales such as United States, India, United Kingdom, Brazil, and Japan and reflected popularity patterns documented by analysts at Nielsen and Comscore.
Video selection used automated filters and heuristics leveraging metadata signals from uploader channels, titles, and descriptions as indexed by YouTube systems and mapped to entity identifiers in the Knowledge Graph. Annotation pipelines used automated classifiers and human review processes influenced by prior corpora such as ImageNet, COCO, and AVA, and incorporated crowdsourced checks similar to protocols used by platforms like Amazon Mechanical Turk and research programs at Prolific. Feature extraction relied on deep convolutional networks trained on datasets including ImageNet and fine-tuned using transfer learning techniques developed at Google Brain and laboratories at Microsoft Research Cambridge.
Standard tasks defined for the dataset included multi-label video classification, temporal localization, and video retrieval benchmarks used in competitions at conferences including NeurIPS and CVPR. Metrics provided in baseline evaluations followed community standards such as mean Average Precision (mAP), Precision@K, Global Average Precision (GAP), and area under the Precision-Recall curve commonly reported by research groups at Stanford University, MIT, and Carnegie Mellon University. Baseline models compared included logistic regression, mixture-of-experts, LSTM-based sequence models, and temporal aggregation schemes analogous to attention mechanisms popularized in work from Google Brain and DeepMind.
Researchers across industry and academia used the release to study large-scale representation learning, transfer learning, weak supervision, and scalable training systems. Follow-on work incorporated the dataset into explorations of self-supervised learning advanced by researchers at Facebook AI Research and OpenAI, domain adaptation studies pursued at ETH Zurich and University of Toronto, and multimodal fusion research appearing in publications from CMU, UC Berkeley, Oxford University, and Imperial College London. The dataset enabled benchmarks cited in influential papers presented at ICML, NeurIPS, ACL, and ICLR, and informed production features and research at products and teams inside Google, YouTube Music, and other multimedia platforms.
Limitations noted by researchers included label noise, sampling bias toward popular and English-language content, demographic skew similar to critiques raised in analyses involving ImageNet and datasets curated by Microsoft Research, and the nondeterministic availability of source videos due to takedowns and copyright claims enforced through systems like Content ID. Ethical considerations emphasized consent, privacy, and representational harms discussed in forums at ACM Conference on Fairness, Accountability, and Transparency and policy groups such as Electronic Frontier Foundation and AlgorithmWatch. Researchers advocated best practices echoed by organizations like Partnership on AI and Data & Society including transparency, deprecation notices for removed items, and evaluation protocols to mitigate biased outcomes.
Category:Video datasets