Generated by GPT-5-mini| UCI Machine Learning Repository | |
|---|---|
| Name | UCI Machine Learning Repository |
| Established | 1987 |
| Location | University of California, Irvine |
| Discipline | Machine learning, data mining |
| Type | Dataset archive |
UCI Machine Learning Repository is a long-standing archive of datasets for empirical research in machine learning and data mining. It was created to support benchmarking and reproducible experimentation for researchers across institutions such as Stanford University, Massachusetts Institute of Technology, Carnegie Mellon University, Harvard University and Princeton University. The Repository has been used by practitioners affiliated with organizations like Google, Microsoft, Facebook, IBM and Amazon and appears in courses taught at University of Oxford, University of Cambridge, ETH Zurich, National University of Singapore and Tsinghua University.
The Repository was initiated during the late 1980s at a research center within University of California, Irvine alongside projects at DARPA, National Science Foundation and collaborations with groups at Bell Labs, IBM Research, AT&T Bell Laboratories and Sandia National Laboratories. Early curators drew comparisons to archives like the Columbia University Archive and databases maintained by Los Alamos National Laboratory and CERN for sharing experimental artifacts. Over time stewardship involved personnel connected to Neural Information Processing Systems conferences, contributors from Association for Computing Machinery and reviewers publishing in journals such as Journal of Machine Learning Research and IEEE Transactions on Pattern Analysis and Machine Intelligence.
The Repository catalogs tabular, time series, image, text and multi-modal datasets originating from sources like experiments at Bell Labs, clinical studies at Mayo Clinic, surveys run by United States Census Bureau and observational collections from projects at NASA and NOAA. Datasets are described with metadata including attribute types, missing values, and provenance linking to publications in venues such as Proceedings of the National Academy of Sciences, NeurIPS, ICML, KDD and AAAI. Collections are organized by task labels (classification, regression, clustering) and by domain tags referencing studies from Stanford Linear Accelerator Center, Sloan Digital Sky Survey, Human Genome Project-related repositories and epidemiological data tied to work at Centers for Disease Control and Prevention. Popular entries include datasets that have been analyzed in papers authored by researchers at MIT Media Lab, Caltech, University of Toronto and Columbia University.
Access to the Repository is provisioned for academic researchers at institutions such as Yale University, Brown University, Duke University, University of Michigan and University of Illinois Urbana-Champaign as well as industry practitioners from Intel, NVIDIA, Oracle and SAP. Use policies require citation practices that reference original dataset sources found in publications hosted by Springer, Elsevier, Wiley-Blackwell and repositories like arXiv and mandate adherence to privacy regulations influenced by legislation such as Health Insurance Portability and Accountability Act and jurisdictional rules from bodies like the European Commission. Redistribution and derivative dataset sharing are guided by licenses similar to those used by Creative Commons, institutional review boards at Johns Hopkins University and technology transfer offices at Columbia University.
The Repository influenced benchmarking traditions in communities around NeurIPS, ICML, KDD, SIGMOD and VLDB, and has been cited in theses from Imperial College London, McGill University, University of Edinburgh and Australian National University. It supported reproducibility efforts in studies led by researchers at Google Brain, DeepMind and academic labs at University of California, Berkeley, University of Washington and Peking University. Community contributions include dataset submissions from teams at CERN, field studies organized by World Health Organization, annotated corpora from British Library initiatives and image collections connected to projects at Getty Research Institute.
Integrations exist linking the Repository to toolchains and platforms like scikit-learn, TensorFlow, PyTorch, WEKA and RapidMiner, and workflows used in notebooks from Jupyter Project, Google Colab and Binder. Programmatic access patterns have been demonstrated in packages maintained by groups at NumFOCUS, and connectors enable import into environments managed by Kubernetes, Docker containers and platforms such as Databricks and Amazon Web Services. Community tooling for dataset curation and versioning references standards advanced by organizations like OpenAI, The Linux Foundation and Apache Software Foundation.
Category:Machine learning datasets