Flame Graphs

Flame Graphs
Name	Flame Graphs
Invented by	Brendan Gregg
First appeared	2011
Domain	Performance analysis, profiling, observability
Implemented in	Linux, macOS, FreeBSD, Windows
Common tools	perf, DTrace, eBPF, SystemTap

Contents

Overview
History and Development
Construction and Interpretation
Implementations and Tools
Use Cases and Applications
Limitations and Criticisms
Related Visualization Techniques

Flame Graphs Flame graphs are a visualization technique for sampled stack traces designed to reveal performance hotspots in software systems by representing call stacks as horizontally stacked, color-coded boxes. They were introduced to help engineers inspect profiling data from environments such as Linux, macOS, and FreeBSD and have been adopted in observability stacks used alongside projects like Prometheus, Grafana, and Kubernetes. Flame graphs are routinely discussed in conferences such as USENIX, Velocity Conference, and Linux Plumbers Conference and taught in courses at institutions like MIT and Stanford University.

Overview

Flame graphs present aggregated stack traces where the x-axis corresponds to cumulative sample counts and the y-axis represents call depth, enabling comparison across functions like those in glibc, OpenSSL, or nginx. Each box typically denotes a symbol from toolchains such as GCC, Clang, or LLVM and can be annotated with line numbers from projects like Linux kernel or applications like Apache HTTP Server and PostgreSQL. Visualization libraries including D3.js and integrations with platforms like Grafana Labs and Elastic Stack render flame graphs for cloud providers such as AWS, Google Cloud Platform, and Microsoft Azure.

History and Development

Brendan Gregg popularized flame graphs in 2011 after work on performance tools at companies including Sun Microsystems and Netflix. The technique grew from earlier ideas in stack profiling used at organizations like Oracle Corporation and research groups at University of California, Berkeley and Carnegie Mellon University. Early adopters included teams at Facebook, Twitter, and Instagram who integrated flame graphs into continuous profiling initiatives influenced by projects such as gprof, oprofile, and DTrace from Sun Microsystems.

Construction and Interpretation

Construction begins by collecting samples via profilers such as perf, DTrace, SystemTap, or eBPF-based tools like those developed by Brendan Gregg and contributors from Netflix. Symbols are resolved using debug information from toolchains like Binutils and DWARF metadata generated by GCC or Clang. The stack traces are aggregated into hierarchical nodes akin to structures used in Callgrind outputs from Valgrind. Interpreting a flame graph involves scanning wide boxes for hotspots, tracing vertical stacks to understand calling contexts in applications like Redis, MySQL, and Java Virtual Machine. Color schemes often mirror palettes from design systems at Google and Apple to improve readability in reports for teams at Microsoft and IBM.

Implementations and Tools

Open-source implementations include scripts and viewers maintained by contributors associated with Brendan Gregg and hosted on platforms like GitHub and GitLab. Integrations exist in observability vendors such as Datadog, New Relic, and Splunk, and in profiling suites like Pyroscope and Parca. System-level tools that produce flame graphs include perf on Linux, DTrace on Solaris and macOS, and eBPF frameworks developed in collaboration with projects like bcc and libbpf. Language-specific profilers—gperftools for C++, JVM Tool Interface-based profilers for Java, py-spy for Python—often output formats convertible to flame graphs.

Use Cases and Applications

Flame graphs are used in performance tuning of web servers such as nginx and Apache HTTP Server, database engines including PostgreSQL and MongoDB, and runtime systems like the JVM and Node.js. They support incident response workflows at companies like Netflix, Airbnb, and Dropbox by rapidly highlighting regressions introduced in commits tracked with GitHub, GitLab CI/CD, or Jenkins. In cloud-native environments orchestrated by Kubernetes and monitored with Prometheus and Grafana, continuous profiling with flame graphs enables capacity planning and cost optimization for infrastructure provided by AWS, GCP, and Azure.

Limitations and Criticisms

Critics note that flame graphs depend on sampling granularity and symbol resolution quality, issues familiar from tools like gprof and Valgrind. In environments with optimized builds from GCC or Clang or stripped binaries common in distributions like Debian and Red Hat Enterprise Linux, interpreting boxes can be hindered by inlined or elided frames, echoing concerns raised in academic venues including ACM and IEEE conferences. There are also challenges integrating flame graphs with distributed tracing systems such as Zipkin and Jaeger, where correlating end-to-end latency with CPU hotspots requires tying profiling data to traces captured by services like Envoy and Istio.

Related techniques include flame graph variants and complementary tools such as stacked area charts used in Tableau and Power BI, call graphs produced by Callgrind and visualized with KCachegrind, and timeline-based tracers like Chromium Tracing and Perfetto. Distributed tracing systems—OpenTelemetry, Zipkin, and Jaeger—offer different perspectives by instrumenting RPC boundaries, while heatmaps and Sankey diagrams in platforms like Grafana and Kibana provide alternative aggregations for operational data.

Category:Data visualization