LLMpediaThe first transparent, open encyclopedia generated by LLMs

StringTie

Generated by GPT-5-mini
Note: This article was automatically generated by a large language model (LLM) from purely parametric knowledge (no retrieval). It may contain inaccuracies or hallucinations. This encyclopedia is part of a research project currently under review.
Article Genealogy
Parent: Ensembl Hop 4
Expansion Funnel Raw 55 → Dedup 0 → NER 0 → Enqueued 0
1. Extracted55
2. After dedup0 (None)
3. After NER0 ()
4. Enqueued0 ()
StringTie
NameStringTie
DeveloperPertea Lab
Released2015
Latest release2.2.2
Programming languageC++
Operating systemLinux, macOS
LicenseGNU General Public License

StringTie StringTie is a transcriptome assembly and quantification tool for RNA sequencing data that assembles and quantifies full-length transcripts from short-read alignments. It integrates information from aligners and annotation resources to reconstruct splice variants, estimate transcript abundances, and produce output compatible with downstream differential-expression and visualization tools. Originally introduced to improve precision and sensitivity of transcript reconstruction, StringTie has been adopted in workflows involving Genome Research (journal), Nature Methods, and large consortia such as the ENCODE Project.

Background

StringTie was developed by researchers in the Pertea Lab to address limitations in transcript assembly evident in earlier tools such as Cufflinks, Trinity (software), and Oases (software). It arose in the context of improvements in high-throughput sequencing instrumentation from companies like Illumina and the expanded scope of projects including The Cancer Genome Atlas and the 1000 Genomes Project. The tool leverages splice-aware aligners such as HISAT2, TopHat2, and STAR (aligner) and interoperates with annotation sets from repositories like GENCODE, RefSeq, and Ensembl. Funding and dissemination occurred alongside organizations including the National Institutes of Health and academic publishers that disseminated methodological advances in journals such as Bioinformatics (journal).

Algorithm and features

StringTie implements a network flow algorithm to model read coverage and splice junction support, inspired by graph-based formulations used in assemblers such as Velvet and concepts from Maximum Flow Problem in computer science. It constructs splice graphs from input alignments and applies a flow-optimization step to identify the minimal set of transcripts that explain observed fragments, similar in principle to techniques used in IsoformSequencing analyses. Key features include reference-guided assembly using annotation inputs from RefSeq or GENCODE, multi-sample merge capability, fragment length and bias correction, and support for stranded and unstranded libraries. Output formats include GTF (file format) and abundance estimates in transcripts per million (TPM) and fragments per kilobase of transcript per million mapped reads (FPKM), compatible with statistical packages like DESeq2, edgeR, and Ballgown. The implementation in C++ emphasizes memory efficiency and parallel processing, facilitating integration with workflow managers such as Snakemake and Nextflow.

Usage and workflows

Typical StringTie workflows begin with read alignment using splice-aware tools such as HISAT2 or STAR (aligner), followed by transcript assembly per sample and optional merging across samples with a utility that produces a unified annotation. The merged annotation can be used for guided quantification across experimental cohorts from projects like GTEx or clinical studies using The Cancer Genome Atlas. Users often combine StringTie output with quality-control and visualization utilities such as IGV (Integrative Genomics Viewer), UCSC Genome Browser, and expression-analysis pipelines employing limma and edgeR. Containerization via Docker or reproducible execution with Conda (package manager) distributions facilitates deployment on high-performance computing clusters managed by systems like SLURM or PBS Professional.

Performance and benchmarking

Benchmarks comparing StringTie to assemblers such as Cufflinks, Scallop, and Trinity (software) report trade-offs between precision and recall in reconstructing known isoforms, with StringTie often achieving higher specificity while maintaining competitive sensitivity. Evaluations conducted in publications and community challenges such as the RNA-seq Assembly (RAMP), synthetic spike-in experiments with ERCC spike-in_control mixes, and benchmarking datasets from SEQC/MAQC-III show that StringTie provides accurate abundance estimates measured against qPCR and long-read sequencing from Pacific Biosciences and Oxford Nanopore Technologies. Performance scales with read depth and library complexity; computational profiling highlights favorable memory usage relative to some graph-based assemblers and strong multithreading efficiency on multicore servers from vendors like Intel and AMD.

Applications and case studies

StringTie has been applied across diverse biological questions: isoform discovery in cancer transcriptomes within The Cancer Genome Atlas, allele-specific expression studies in human population cohorts in 1000 Genomes Project-linked analyses, developmental transcriptomics in model organisms such as Mus musculus and Drosophila melanogaster, and host–pathogen interaction profiling in studies involving Mycobacterium tuberculosis. Case studies demonstrate its utility in identifying novel splice variants associated with clinical phenotypes reported in journals like Nature Communications and Genome Biology. Integrative analyses combining StringTie-derived assemblies with long-read validation from Pacific Biosciences or Oxford Nanopore Technologies have refined gene models submitted to GENCODE and RefSeq.

Development and availability

StringTie development is led by the Pertea Lab with contributions from collaborators at institutions such as Johns Hopkins University and distribution through code hosting platforms and academic repositories. Releases and source code follow open-source practices under the GNU General Public License and are packaged for distribution in environments managed by Bioconda, GitHub, and community software registries used by projects like Galaxy (computational biology) for accessible genomics workflows. Training materials and workshops are presented at conferences including ISMB and RECOMB, and the tool is cited widely in literature indexed by PubMed Central.

Category:Bioinformatics software