gflops and GFLOPS: The Definitive Guide to Floating‑Point Performance

gflops and GFLOPS: The Definitive Guide to Floating‑Point Performance

Pre

In the world of modern computing, the term gflops—often written GFLOPS in technical documentation—serves as a shorthand for a system’s capability to perform floating‑point operations per second. While it sounds simple, the practical meaning of gflops is nuanced. This long, thorough guide unpacks what GFLOPS measure, how to compare numbers across CPUs, GPUs and accelerators, and how to translate raw figures into real‑world performance gains. Whether you are sizing a new workstation, benchmarking a data centre, or planning an AI project, understanding GFLOPS helps you make informed decisions.

What does gflops actually measure? GFLOPS explained

GFLOPS stands for giga floating‑point operations per second. In other words, it is a rate: the number of floating‑point additions, multiplications, and fused multiply–adds a processor can execute every second. A single FLOP is one floating‑point operation; a GFLOP is one billion (10^9) floating‑point operations. When we talk about GFLOPS, we are often referring to peak theoretical capabilities, as well as observed, sustained performance under real workloads.

Peak versus sustained performance

Peak GFLOPS describe the absolute maximum operations a device could perform under ideal conditions. Sustained GFLOPS reflect real performance over a representative workload, which is typically lower due to memory bandwidth limitations, cache misses, branching, and other overheads. For meaningful comparisons, it is essential to distinguish the two: a device with a high peak GFLOPS rating is not always the fastest in practice if its memory system or software stack cannot keep the cores fed with data.

GFLOPS, FLOPS and related units

Beyond GFLOPS, you will encounter TFLOPS (teraFLOPS, 10^12), PFLOPS (petaFLOPS, 10^15) and variations such as FP32, FP64, FP16, and BF16. These refer to floating‑point precisions used during calculations. The choice of precision affects both the achievable GFLOPS and the accuracy of results. In AI workloads, FP16 and BF16 are common, often delivering higher GFLOPS due to simpler arithmetic and wider SIMD utilisation, while scientific computing frequently relies on FP64 for numerical stability.

How GFLOPS are calculated and interpreted

In its simplest form, GFLOPS can be estimated with the formula:

GFLOPS ≈ (Number of floating‑point operations) / (Elapsed time in seconds)

To obtain a meaningful figure, you must count the exact number of operations the workload performs. A matrix multiply of two n×n matrices, for example, performs roughly 2n^3 floating‑point operations (n^3 multiplications and n^3 additions). If you run this operation in a given time window, you can compute the achieved GFLOPS. The challenge is that actual operation counts depend on the algorithm, data layout, and hardware microarchitecture.

Counting operations in real workloads

Different libraries and benchmarks may count operations differently. Some count every multiply and add separately; others count fused multiply–adds as a single operation. When you compare GFLOPS numbers, ensure you understand the counting convention. Consistency matters, especially when benchmarking across CPUs, GPUs, and specialised accelerators. If you cannot access the exact operation count, rely on standard benchmarks whose counting methods are documented and widely accepted.

GFLOPS and precision: FP32, FP64, FP16, and beyond

The precision used in a calculation changes both performance and output accuracy. High‑precision calculations ( FP64, or double precision) require more storage and typically use fewer arithmetic units per cycle, reducing the GFLOPS that can be achieved for a given workload. Lower precision (FP16, BF16, or INT8/INT4 in certain neural networks) can unlock higher GFLOPS by enabling wider vector units and more parallelism, albeit with trade‑offs in numerical precision.

Single precision versus double precision

FP32 (single precision) is the workhorse for many workloads. It offers a good balance of speed and accuracy and is supported broadly across CPUs and GPUs. FP64 (double precision) is essential for numerical simulations where tiny errors can propagate and accumulate. The GFLOPS achievable in FP64 mode is typically lower than FP32 because more bits per value require more resources. When benchmarking, always specify the precision used, as a device’s FP32 GFLOPS can be several times higher than its FP64 GFLOPS.

Emerging formats: FP16, BF16 and tensor cores

For machine learning and AI inference, FP16 and BF16 formats are common. They reduce memory bandwidth requirements and increase throughput. Many modern GPUs include tensor cores that accelerate certain mixed‑precision operations, delivering dramatically higher GFLOPS for eligible workloads. If your software can exploit these units, your observed GFLOPS may soar, but only for specific patterns of computation such as large matrix multiplications typical of neural networks.

Where GFLOPS live: CPUs, GPUs and specialised accelerators

GFLOPS are not tied to a single kind of hardware; they are a way to quantify the performance potential of a system’s arithmetic units. The architecture determines how that potential is realised. Here is a snapshot of typical landscape in 2020s technology:

CPUs

Central Processing Units (CPUs) deliver a balanced mix of single‑thread performance, multi‑thread scalability, memory bandwidth, and latency. Modern CPUs boast wide SIMD units (such as AVX‑512 in some Intel chips or AVX2 in others) and substantial cache hierarchies. Their peak GFLOPS are often impressive thanks to high clock speeds and broad instruction sets, but sustained performance depends heavily on memory access patterns and vectorisation efficiency.

GPUs

Graphics Processing Units (GPUs) unite thousands of lightweight cores designed for massive data parallelism. They are particularly strong for high‑throughput linear algebra, such as dense matrix multiplications and convolutions used in AI and scientific computing. The GFLOPS rating on a GPU often dwarfs CPUs for the same precision, primarily due to large SIMD widths, high memory bandwidth, and specialised units like tensor cores on modern architectures.

Specialised accelerators

Beyond general CPUs and GPUs, there are accelerators tailored for specific tasks—FPGA implementations, AI inference cards, and ASICs designed for deep learning workloads. These devices may deliver extraordinary GFLOPS for narrow workloads but can require effort to programme efficiently. When evaluating GFLOPS, consider how well the workload matches the hardware’s strengths and the availability of mature software stacks.

Benchmarking GFLOPS: practical approaches

To assess GFLOPS meaningfully, you should combine synthetic benchmarks with real‑world tests. Synthetic benchmarks can reveal theoretical capabilities and raw throughput, while real workloads expose end‑to‑end performance, including memory access, data transfer times, and software efficiency.

Classic benchmarks: LINPACK and HPCG

LINPACK is a longstanding benchmark that solves a system of linear equations and reports the resulting GFLOPS. HPCG (High Performance Conjugate Gradient) complements LINPACK by emphasising memory bandwidth and the interaction between computation and memory. Both are useful in HPC contexts, but they measure different aspects of system performance, so report them together to avoid misinterpretation.

AI and ML benchmarks

For AI workloads, benchmarks such as matrix multiplications of large sizes, as well as full training and inference pipelines, provide a realistic view of GFLOPS potential. Frameworks that leverage GPU accelerators—like CUDA libraries for matrix multiplies and convolutions—help you estimate achievable GFLOPS in practice. When comparing devices for ML, ensure the benchmark uses comparable precision and data shapes to avoid skewed conclusions.

Microbenchmarks and workloads you can run

If you want a quick, repeatable sanity check, perform a large matrix multiply in FP32 and measure the time to completion, repeating with FP16 on hardware that supports mixed precision. For CPUs, try a dense linear algebra kernel from your favourite linear algebra library, noting the time and reported throughput. For GPUs, use a standard CUDA or ROCm routine with attention to memory coalescing and kernel occupancy. Document the precision, problem size, and software stack to enable apples‑to‑apples comparisons.

The Roofline model: a framework for interpreting GFLOPS

The Roofline model provides a visual and conceptual way to understand the limits on performance. It plots attainable GFLOPS against arithmetic intensity (the ratio of computations to memory traffic). The model defines two primary ceilings: peak compute performance and memory bandwidth. When your workload’s arithmetic intensity lies to the left of the bandwidth ceiling, memory bandwidth is the bottleneck; to the right of the compute ceiling, the compute units are the limiting factor. The goal is to move operations higher up the “roof” by increasing arithmetic intensity or improving data locality and vector utilisation. This framework helps decision‑makers interpret GFLOPS figures with nuance rather than treating them as standalone numbers.

How to increase GFLOPS in practice

If you want higher observed GFLOPS for your workload, you can pursue several avenues. The best approach depends on the bottleneck in your system, but common strategies include optimizing algorithms, improving data locality, and selecting hardware with architectures that fit your problem class.

Software optimisations

  • Improve data layout and memory access patterns to maximise cache hits and reduce memory bandwidth pressure.
  • Vectorise hot loops to exploit SIMD units and achieve wider data paths per instruction.
  • Leverage fused multiply–add (FMA) instructions where available to reduce instruction counts and improve throughput.
  • Choose libraries and frameworks that are optimised for your hardware and precision requirements.
  • Profile and tune kernels using profiling tools to identify stalls, such as cache misses or branch mispredictions.

Hardware choices

  • Match problem size to device memory and compute capabilities; for large problems, GPUs with high memory bandwidth can outperform CPUs by wide margins.
  • Consider precision needs: if your application tolerates lower precision, FP16 or BF16 can unlock higher GFLOPS due to wider vector units and faster arithmetic.
  • Explore accelerators with specialised units (tensor cores, matrix engines) that accelerate the most common operations in your workload.
  • Assess energy efficiency alongside raw GFLOPS; sometimes a device with lower peak GFLOPS but better efficiency delivers better performance per watt in real workloads.

GFLOPS in real projects: case studies and practical guidance

To illustrate how GFLOPS translate into tangible outcomes, consider a few representative scenarios drawn from high‑performance computing and data science. These examples highlight the importance of context when interpreting GFLOPS numbers and remind us that “more is not always better” without considering memory, precision, and software efficiency.

Case study: dense matrix multiplication on a GPU

For a large n×n matrix multiply, the theoretical operation count is 2n^3. On a modern GPU with FP32 capability, you may achieve several tens of TFLOPS in sustained performance for well‑tuned kernels. The actual performance depends on matrix size, device memory bandwidth, kernel occupancy, and how well the code uses shared memory and registers. A well‑tuned kernel can approach a significant fraction of the theoretical GFLOPS, while a naïve implementation may linger well below it.

Case study: scientific simulation on a CPU cluster

In a CPU‑based simulation that relies on FP64 arithmetic, the observed GFLOPS is often constrained by memory bandwidth and latency. CPUs with large caches and careful vectorisation can deliver strong sustained GFLOPS, but the gap to GPU performance for large linear algebra workloads is non‑trivial unless the problem scales well across many cores and the data stays local. This underscores the importance of holistic benchmarking rather than relying solely on peak GFLOPS figures.

Interpreting GFLOPS: what matters for your decision

When comparing devices, it is not enough to look at a single GFLOPS figure. You should consider:

  • Precision needs: FP32 versus FP64; AI workloads often thrive on lower precision and higher throughput.
  • Arithmetic intensity: workloads with high computational demands relative to data movement tend to benefit more from faster compute units.
  • Memory bandwidth and latency: data must reach the compute units quickly for sustained performance.
  • Software ecosystem: availability of optimised libraries, compilers, and profiling tools can dramatically influence real‑world GFLOPS.
  • Scale and parallelism: how well the problem splits across cores or accelerators, and the efficiency of interconnects in a cluster.

Glossary: key terms you’ll encounter with GFLOPS

  • GFLOPS: giga floating‑point operations per second; a measure of throughput for floating‑point compute.
  • FLOP: floating‑point operation; a single addition or multiplication (or fused multiply–add in some contexts).
  • FMA: fused multiply–add; a single instruction that performs a multiplication and an addition in one operation.
  • FP32/FP64/FP16/BF16: floating‑point precisions—single, double, half, and bfloat16, respectively.
  • Roofline model: a framework for visualising the limits imposed by compute and memory bandwidth on achievable GFLOPS.
  • Arithmetic intensity: the ratio of total floating‑point operations to bytes moved from memory; a measure of computation to memory balance.

Practical tips for reporting and comparing GFLOPS

To communicate GFLOPS clearly and fairly, consider the following best practices:

  • State the precision used (e.g., FP32, FP64, FP16) and the exact operation count method that you used to calculate the GFLOPS.
  • Specify the workload type (synthetic kernel, real‑world application, neural network layer, etc.).
  • Indicate the platform details: CPU/GPU model, memory configuration, software stack versions, and compiler optimisations.
  • When possible, provide both peak GFLOPS and sustained GFLOPS for transparency.
  • Use consistent problem sizes across devices to ensure meaningful comparisons, and where possible, present per‑operation efficiencies to complement raw numbers.

Conclusion: turning GFLOPS from a number into a performance advantage

GFLOPS are a foundational metric for evaluating floating‑point performance, but they are not a stand‑alone verdict on overall system capability. A high GFLOPS rating is alluring, yet real‑world performance depends on the harmony between computation, memory, software, and workload characteristics. By understanding peak and sustained GFLOPS, the impact of precision, and the role of memory bandwidth, you can make smarter hardware choices, optimise software effectively, and interpret benchmarking results with greater sophistication. In short, gflops and GFLOPS are not merely numbers on a spec sheet; they are a language for describing how fast your problem gets solved in practice.

Further reading and practical next steps

If you are ready to dive deeper, here are practical steps you can take today to improve your understanding and your systems’ GFLOPS performance:

  • Run representative benchmarks on your current hardware, documenting precision, problem size, and software stack.
  • Experiment with different precisions to discover the sweet spot between accuracy and throughput for your workload.
  • Profile memory usage and data movement to identify bottlenecks limiting sustained GFLOPS.
  • Explore compiler optimisations and library routines tuned for your architecture to maximise SIMD utilisation.
  • Consider architectural choices aligned with your workload’s arithmetic intensity and data locality requirements.

With a solid grasp of GFLOPS, you can translate technical specifications into actionable performance insights, and transform raw numbers into tangible improvements for research, development, and production systems.