File Information Tool Set (FITS)

File Information Tool Set (FITS)
Name	File Information Tool Set
Developer	Harvard Library
Released	2009
Programming language	Java, Python, Ruby
Operating system	Cross-platform
Genre	Digital preservation, file characterization

Contents

Overview
Features and Components
Architecture and Workflow
File Format Identification and Metadata Extraction
Use Cases and Applications
Development and Community
Licensing and Availability

File Information Tool Set (FITS) is a software package designed to assist cultural heritage institutions, archives, libraries, and museums with file format identification, metadata extraction, and digital preservation workflows. It aggregates multiple open-source tools to produce normalized technical metadata for files, enabling interoperability among repositories, preservation systems, and cataloging platforms. FITS functions as an orchestration layer that integrates well-known utilities to provide a consolidated output useful to practitioners in digital stewardship.

Overview

FITS was originally developed by staff at Harvard Library, the Library of Congress, the University of Virginia, and the Digital Preservation Coalition to address needs identified by projects like the National Digital Information Infrastructure and Preservation Program, the Keepers Registry, and the Open Preservation Foundation. The project coordinated work drawing on expertise from institutions such as the Smithsonian Institution, the British Library, the New York Public Library, the Bibliothèque nationale de France, and the California Digital Library. FITS became part of the broader ecosystem that includes formats registries like PRONOM and tools such as DROID, ExifTool, Tika, and JHOVE, enabling cultural heritage organizations such as the Wellcome Library, Stanford University Libraries, and the Library and Archives Canada to standardize technical metadata extraction.

Features and Components

FITS provides a modular framework that wraps multiple characterization tools including DROID from The National Archives, JHOVE from the Harvard University Library Innovation Lab, ExifTool by Phil Harvey, Tika by the Apache Software Foundation, and MediaInfo used by broadcasting archives and the Internet Archive. The package normalizes results into a unified XML schema, which is consumable by repository platforms like Fedora Commons, DSpace, Islandora, and Archivematica. FITS also supports scripting and integration with workflow engines used by projects at institutions such as the Digital Public Library of America, Europeana, and the Council on Library and Information Resources.

Architecture and Workflow

FITS employs a pipeline architecture that calls external signature-based and heuristic analyzers, consolidates outputs, and maps tool-specific metadata fields to a common representation. The orchestration design mirrors patterns used in preservation systems developed by the National Digital Stewardship Alliance, the Open Preservation Foundation, and software stacks from entities like Adobe Systems and Microsoft Research that influence file format handling. FITS configurations enable point-and-click and command-line operation on platforms from Debian and Red Hat distributions to macOS and Windows Server installations, and can be embedded in containerized deployments orchestrated by Docker or Kubernetes in institutional IT environments.

File Format Identification and Metadata Extraction

FITS delegates format identification to tools that rely on registries and signature files such as PRONOM via DROID, magic number libraries like libmagic used by Unix-like systems, and format validation modules like JHOVE for TIFF, PDF/A, and WAV. For image files, FITS leverages tools similar to ImageMagick and ExifTool to extract EXIF, IPTC, and XMP metadata; for audio and video, it integrates MediaInfo and FFmpeg-style utilities to capture codec, bitrate, and container metadata consistent with standards promoted by the Moving Picture Experts Group and the International Federation of Film Archives. The normalized FITS output maps extracted fields to XML elements that can be transformed into PREMIS, METS, MODS, and Dublin Core metadata records for ingestion into preservation repositories used by institutions such as MOMA, the Getty Research Institute, and national archives.

Use Cases and Applications

Institutions apply FITS in accessioning, appraisal, format risk assessment, and migration planning workflows employed by archives like the National Archives and Records Administration, the Public Record Office, and municipal archives. Libraries and museums use FITS to generate technical surrogates for digital exhibits at institutions such as the Smithsonian American Art Museum, the Victoria and Albert Museum, and the Museum of Modern Art. FITS is also used in scholarly communication infrastructures—such as institutional repositories at Harvard University, Yale University, and the University of Oxford—to support research data management policies from funders like the National Science Foundation and the Wellcome Trust.

Development and Community

The FITS project evolved through contributions from developers affiliated with Harvard Library, Johns Hopkins University, the Library of Congress, and open-source contributors connected to foundations such as the Apache Software Foundation and the Open Preservation Foundation. Community activity has taken place on code hosting platforms that host collaborative projects with participants from universities, cultural heritage networks like the Digital Preservation Coalition, and international standards bodies including the International Organization for Standardization and the Library of Congress. Documentation, issue tracking, and community discussions have been influenced by practices from projects like Archivematica, BitCurator, and the OpenRefine community.

Licensing and Availability

FITS has been distributed under open-source licensing compatible with community adoption by academic, government, and private-sector institutions, enabling redistribution and embedding within proprietary and public-domain preservation systems used by enterprises such as IBM and Hewlett Packard Enterprise as well as community projects supported by the Mellon Foundation and the Andrew W. Mellon Foundation. The software and its source code have historically been made available through common code repositories used by projects like GitHub and SourceForge for download and local deployment by archivists and technologists.

Category:Digital preservation software