SymbolicData.org — LLMpedia

SymbolicData.org
Name	SymbolicData.org
Type	Research dataset repository
Established	2002
Founders	Marcus Speh, Michael Bronstein, Thomas R. Tuytelaars
Headquarters	ETH Zurich

Contents

Overview
History and Development
Data Content and Structure
Tools and Services
Community and Governance
Applications and Impact

SymbolicData.org is a curated repository for datasets and benchmarks in computer vision and pattern recognition used by researchers in machine learning, robotics, and computational geometry. The project supports reproducible evaluation and sharing among practitioners linked to institutions such as ETH Zurich, Max Planck Society, University of Oxford, Massachusetts Institute of Technology, and University of California, Berkeley. It has been cited in work associated with conferences like CVPR, ICCV, ECCV, NeurIPS, and ICML and interacts with projects at laboratories including Stanford Artificial Intelligence Laboratory, Google Research, Facebook AI Research, Microsoft Research, and Amazon Web Services.

Overview

SymbolicData.org provides standardized datasets, metadata schemas, and benchmark protocols that enable comparative evaluation across algorithms from communities tied to European Research Council, NSF, DARPA, ERC Starting Grant, and Marie Skłodowska-Curie Actions. The repository emphasizes interoperability with toolchains from OpenCV, TensorFlow, PyTorch, scikit-learn, and MATLAB while aligning with data standards promoted by IEEE, ISO, W3C, ACM, and NIST. It is referenced in journal venues such as IEEE Transactions on Pattern Analysis and Machine Intelligence, International Journal of Computer Vision, Pattern Recognition Letters, Journal of Machine Learning Research, and Nature Machine Intelligence.

History and Development

SymbolicData.org originated from collaborations among researchers at ETH Zurich, University of Bonn, University of Graz, and Queen Mary University of London with early influence from datasets like Caltech 101, MNIST dataset, PASCAL VOC, ImageNet, and COCO dataset. Development phases track presentations at workshops hosted by IJCAI, ECCV Workshops, CVPR Workshops, and ACCV and funding from agencies including Swiss National Science Foundation, DFG, European Commission, and Horizon 2020. Successive versions integrated contributions coordinated at institutes such as Imperial College London, University of Cambridge, University of Edinburgh, and University of Toronto and were shaped by community feedback via meetings at Google Summer of Code, NeurIPS Workshops, SIGGRAPH, and ISMAR.

Data Content and Structure

Datasets in the repository include annotated image collections, 3D models, symbolically annotated object classes, and canonical benchmarks frequently used in studies referencing SIFT, SURF, ORB, Harris corner detector, and RANSAC. Metadata conforms to schemas interoperable with resources like PASCAL Visual Object Classes, Open Images Dataset, ShapeNet, ModelNet, and KITTI dataset and links to standards advocated by Dublin Core, Schema.org, PROV, and FAIR Principles. Content types cover categories familiar from corpora such as Caltech-256, Oxford Buildings Dataset, SUN Database, ADE20K, and Cityscapes, while annotations reference ontologies curated by WordNet, BabelNet, Wikidata, DBpedia, and YAGO.

Tools and Services

The platform supplies tools for dataset conversion, benchmarking scripts, and evaluation metrics compatible with ecosystems like Caffe, Theano, MXNet, Keras, and ONNX. It integrates with continuous integration providers such as Jenkins, GitLab CI/CD, Travis CI, CircleCI, and Azure DevOps and supports version control workflows via GitHub, GitLab, Bitbucket, Zenodo, and Figshare. Visualization and annotation utilities draw upon projects like LabelImg, COCO Annotator, Meshlab, Blender, and ParaView, while reproducibility efforts align with initiatives like ReproZip, Binder, Docker, Singularity, and Anaconda.

Community and Governance

Governance is managed by an editorial committee composed of academics from ETH Zurich, TU Graz, Technical University of Munich, University of Bologna, and Politecnico di Milano with contributions tracked via issue systems under organizations such as GitHub, coordinated through workshops at CVPR, ECCV, ICCV, NeurIPS, and community meetings at ICLR and ECAI. The project accepts dataset contributions, pull requests, and benchmark proposals from researchers at Princeton University, Columbia University, Yale University, Cornell University, and Johns Hopkins University following licensing aligned with Creative Commons, MIT License, Apache License, and GPL. Outreach and training have been conducted in collaboration with summer schools like MLSS, programs by DeepLearning.AI, and hackathons organized by Open Data Institute and Kaggle.

Applications and Impact

Datasets and benchmarks have been used in applications spanning object recognition, robotics grasping, augmented reality, and autonomous driving in studies affiliated with Boston Dynamics, Waymo, Tesla Autopilot, NVIDIA Research, and ABB Robotics. The resource influenced methodologies cited in patents filed at USPTO, evaluations in industry reports by McKinsey & Company, Gartner, and Forrester Research, and standards discussions at IEEE Standards Association and ISO/IEC JTC 1. Its datasets underpin academic theses at Harvard University, Princeton University, University of Michigan, University of California, Los Angeles, and Duke University and continue to be referenced in projects funded by Wellcome Trust, Bill & Melinda Gates Foundation, and Chan Zuckerberg Initiative.

Category:Computer vision datasets