Generated by GPT-5-mini| ROUGE | |
|---|---|
| Name | ROUGE |
| Title | ROUGE |
| Developer | University of Montreal; Microsoft Research; IBM Research |
| Released | 2004 |
| Latest release version | variations |
| Programming language | Perl; Python (programming language); Java (programming language) |
| Operating system | Linux; macOS; Microsoft Windows |
| Genre | Automatic evaluation metric |
ROUGE
ROUGE is an automatic set of evaluation metrics for assessing the quality of summaries and generated text by comparing system outputs to human reference texts. It has been used extensively in research contexts including ACL (conference), NAACL, EMNLP, NIPS, and ICML and has influenced evaluation practices at institutions such as Stanford University, Massachusetts Institute of Technology, Carnegie Mellon University, University of Edinburgh, and Tsinghua University.
ROUGE was introduced to provide repeatable, quantitative comparisons for tasks in Document summarization, machine translation, question answering, text generation, and automatic speech recognition evaluation. Early adopters included researchers from Columbia University, University of Washington, University of Pennsylvania, New York University, and University of California, Berkeley, and it has been applied in shared tasks organized by DUC (Document Understanding Conference), TAC (Text Analysis Conference), SemEval, CLEF, and CoNLL. ROUGE interacts with systems and toolkits from Moses (statistical machine translation), OpenNMT, TensorFlow, PyTorch, SpaCy, and NLTK.
ROUGE originated in work led by researchers connected to University of Montreal and evaluation campaigns such as DUC and TAC. Early implementations were discussed alongside efforts at IBM Research and Microsoft Research, with comparisons to metrics from BLEU proponents at Google Research and Harvard University. Development evolved through collaborations among teams at University of Toronto, Johns Hopkins University, SRI International, University of Oxford, University College London, Max Planck Institute for Informatics, and Chinese Academy of Sciences. Successive workshops at ACL, NAACL, EMNLP, and IJCAI shaped the metric's adoption, and follow-up research from Facebook AI Research, DeepMind, OpenAI, and Baidu Research expanded its use in neural generation evaluation.
ROUGE compares automatic outputs to reference summaries using overlapping units such as n-grams, longest common subsequence, and skip-bigrams. Core components were formalized in papers from researchers affiliated with University of Montreal and were benchmarked against baselines from Duke University, University of Maryland, University of Michigan, Georgia Institute of Technology, and Purdue University. Implementations often compute precision, recall, and F-measure and are integrated into pipelines leveraging libraries from Scikit-learn, Pandas, NumPy, SciPy, and Matplotlib. Specific ROUGE variants include n-gram overlap measures (ROUGE-N), longest common subsequence (ROUGE-L), and skip-bigram co-occurrence (ROUGE-S), each contrasted in evaluations at TAC and DUC and cited alongside works from Peter F. Brown and groups at AT&T Labs and Microsoft Research Asia.
Researchers extended ROUGE to address lexical, semantic, and syntactic gaps using ideas from WordNet, FrameNet, PropBank, and embedding models from word2vec creators at Google Research, and contextual models from BERT teams at Google AI Language. External scoring adaptations incorporate semantic similarity methods from GloVe developers at Stanford University, paraphrase resources from Microsoft Research and University of Illinois Urbana-Champaign, and alignment techniques used by groups at Carnegie Mellon University and Johns Hopkins University. Hybrid metrics combine ROUGE with evaluation proposals from METEOR authors, TER maintainers, and neural evaluation frameworks advanced by Facebook AI Research, OpenAI, DeepMind, and Salesforce Research.
ROUGE has been applied in benchmarking summarization systems from research teams at Google Research, Facebook AI Research, OpenAI, Microsoft Research, Amazon Web Services, and in industrial projects at Reuters, BBC, The New York Times, The Guardian, and Bloomberg. It remains central in academic competitions hosted by DUC, TAC, SemEval, WMT (Workshop on Machine Translation), and TREC. Limitations prompted cross-references to human evaluation protocols from ISO (International Organization for Standardization) recommendations, and critiques in literature from scholars at Cornell University, Princeton University, Yale University, and Brown University. Known weaknesses include sensitivity to surface-form overlap, which led to semantic-aware extensions involving contributions from Stanford NLP Group, Allen Institute for AI, MIT CSAIL, and Berkeley NLP.
Category:Evaluation metrics