ARC (software) — LLMpedia

ARC (software)
Name	ARC
Developer	Hyperion Solutions; Broad Institute; OpenAI
Released	2008
Latest release version	3.2
Programming language	C++, Python, JavaScript
Operating system	Windows, macOS, Linux
License	Proprietary, open-source components

Contents

History
Features
Architecture and Design
Development and Community
Use Cases and Adoption
Licensing and Availability

ARC (software) is a modular application platform designed for automating data integration, analysis, and workflow orchestration across bioinformatics, enterprise analytics, and scientific computing environments. It combines scheduling, plugin extensibility, and a graphical interface to connect heterogeneous tools and datasets from projects associated with institutions such as the Broad Institute, Massachusetts Institute of Technology, and corporations like Microsoft Corporation and IBM. ARC has been used alongside packages from Apache Software Foundation, NumPy, and TensorFlow in pipelines that span clinical research, genomics, and cloud computing.

History

ARC emerged in the late 2000s amid a proliferation of pipeline managers and workflow systems influenced by initiatives at Lawrence Berkeley National Laboratory and the European Bioinformatics Institute. Early work drew on concepts from the Human Genome Project era and collaborations involving contributors at the Broad Institute and Howard Hughes Medical Institute. Subsequent development saw integrations with cloud services offered by Amazon Web Services, Google Cloud Platform, and enterprise offerings from Microsoft Azure. Community-led forks and extensions paralleled movements around Apache Airflow and Nextflow as reproducible research and continuous integration practices matured.

Features

ARC provides a visual workflow designer, job scheduler, and plugin system that enable orchestration of tasks invoking tools such as BLAST, BWA, GATK, and custom scripts written for Python (programming language), R (programming language), or Java (programming language). It supports data provenance tracking compatible with standards promoted by FAIR principles proponents and integrates authentication with identity providers like OAuth and LDAP. Monitoring integrates with telemetry stacks from Prometheus and Grafana while artifact storage interoperates with object stores from Amazon S3 and Google Cloud Storage. Enterprise features include role-based access control used by organizations such as National Institutes of Health and European Molecular Biology Laboratory.

Architecture and Design

ARC's architecture typically separates a controller, worker agents, and a persistence layer, a pattern reminiscent of distributed systems pioneered at Google and discussed in literature from ACM conferences. The controller implements a directed acyclic graph (DAG) executor influenced by designs in Apache Airflow and academic descriptions from Stanford University research groups. Worker agents run on nodes managed by orchestration engines including Kubernetes or resource managers like Slurm Workload Manager. Persistent metadata is stored in relational engines such as PostgreSQL or MySQL, while object artifacts are kept in storage systems used by Dropbox and Box, Inc.-style services. Security models reference guidelines from National Institute of Standards and Technology and compliance regimes followed by Food and Drug Administration-regulated labs.

Development and Community

Development of ARC has involved contributors from academic labs, commercial vendors, and open-source projects associated with foundations like the Apache Software Foundation and Linux Foundation. Mailing lists, issue trackers, and continuous integration pipelines have been hosted on platforms similar to GitHub and GitLab. Conferences and workshops where ARC implementations have been presented include Bioinformatics Open Days, sessions at EMBL-EBI meetings, and tracks at NeurIPS and ISMB where workflow reproducibility is discussed. User communities include bioinformatics cores at institutions such as Wellcome Trust Sanger Institute and data engineering teams at companies like Intel Corporation and NVIDIA.

Use Cases and Adoption

ARC has been adopted for high-throughput sequencing pipelines at facilities such as Broad Institute cores, clinical genomics services connected to Centers for Disease Control and Prevention, and environmental sequencing studies coordinated by groups like United Nations Environment Programme. Other use cases include machine learning model training orchestration for teams at OpenAI and financial analytics prototypes in firms associated with Goldman Sachs and JPMorgan Chase. Integrations with laboratory information management systems used by Thermo Fisher Scientific and Illumina illustrate cross-vendor workflows, while collaborations with cloud providers facilitate scalable deployments for projects with the European Space Agency and national supercomputing centers.

Licensing and Availability

ARC distributions have historically included a mixture of proprietary components, community-maintained open-source modules, and permissively licensed connectors inspired by licenses from projects such as Apache License and MIT License. Commercial editions offered enterprise support, subscription services, and certified integrations with systems from Red Hat and Oracle Corporation. Community editions and third-party forks have been accessible via repositories on platforms resembling GitHub under open-source terms, while commercially packaged installers have been distributed through channels used by Microsoft Store and enterprise procurement at organizations like Siemens.

Category:Workflow management systems