Running Performance Benchmarks Effectively
2 min readDec 24, 2024
General Advice on Benchmarking
Accept the Risk of Misleading Results:
- Any experiment, including benchmarking, has a risk of producing misleading results. Even with perfect execution, there’s at least a 1/10 chance of being misled. This risk increases for those not following best practices.
Benchmark Large Code, Not Small Snippets:
- Small snippets of code behave differently in isolation compared to real-world usage due to optimizations like JIT or AOT compilers, which rely on heuristics and context.
- Measure performance in an end-to-end scenario to better account for real-world behavior, such as how your changes impact overall application performance.
Use A:B/Baseline Testing for Comparisons:
- Test alternative implementations by benchmarking the entire application with one version versus another to observe the real impact.
Supporting Opinions and Additional Insights
Testing in Larger Contexts:
- Large-scale tests reduce the likelihood of misleading results, though they don’t eliminate the risk entirely.
- Iterative optimization using small benchmarks can guide early improvements but should always be validated in a larger context.
Micro-Benchmarking Risks:
- Micro-benchmarks can produce inaccurate results if you don’t understand the underlying system. Frameworks or harnesses used in such benchmarks can sometimes introduce performance issues unrelated to real-world usage.
Two-Phase Approach:
- Phase 1: Use small-scale benchmarks during initial optimization.
- Phase 2: Validate the changes in a larger context to ensure they scale correctly before finalizing.
Multiplying Call Counts:
- Running code multiple times in a loop can amplify performance patterns, making it easier to identify trends. However, this approach can also create false signals.
Deterministic Testing:
- Avoid accidental duplication of work and ensure test cases are deterministic. Verify expected invocation counts to ensure the observed behavior aligns with expectations.
Garbage Collection (GC) Sensitivity:
- GC behavior can vary between environments and heavily influence benchmark results. Custom benchmark runners or tools like
js-framework-benchmark
or WebKit’sspeedometer
are recommended for realistic measurements
Caching Effects:
- Caching can obscure performance signals. Be mindful of how it interacts with benchmarks and verify that results reflect actual improvements rather than artifacts of caching.
Recommendations for Reliable Benchmarking
Focus on Realistic Workloads:
- Benchmark entire workflows or applications instead of isolated functions.
Understand Your Environment:
- Be aware of how language runtimes, JITs, and GC implementations affect performance.
Iterate and Validate:
- Use small benchmarks for quick feedback, but always verify changes in a real-world context.
Use Reliable Tools:
- Tools like
js-framework-benchmark
and WebKit’sspeedometer
are designed to avoid common pitfalls in benchmarking.
By following these principles, you can reduce the risk of misleading results and ensure your benchmarks provide meaningful insights.