Running Performance Benchmarks Effectively

Natarajan Santhosh
2 min readDec 24, 2024

--

General Advice on Benchmarking

Accept the Risk of Misleading Results:

  • Any experiment, including benchmarking, has a risk of producing misleading results. Even with perfect execution, there’s at least a 1/10 chance of being misled. This risk increases for those not following best practices.

Benchmark Large Code, Not Small Snippets:

  • Small snippets of code behave differently in isolation compared to real-world usage due to optimizations like JIT or AOT compilers, which rely on heuristics and context.
  • Measure performance in an end-to-end scenario to better account for real-world behavior, such as how your changes impact overall application performance.

Use A:B/Baseline Testing for Comparisons:

  • Test alternative implementations by benchmarking the entire application with one version versus another to observe the real impact.

Supporting Opinions and Additional Insights

Testing in Larger Contexts:

  • Large-scale tests reduce the likelihood of misleading results, though they don’t eliminate the risk entirely.
  • Iterative optimization using small benchmarks can guide early improvements but should always be validated in a larger context.

Micro-Benchmarking Risks:

  • Micro-benchmarks can produce inaccurate results if you don’t understand the underlying system. Frameworks or harnesses used in such benchmarks can sometimes introduce performance issues unrelated to real-world usage.

Two-Phase Approach:

  • Phase 1: Use small-scale benchmarks during initial optimization.
  • Phase 2: Validate the changes in a larger context to ensure they scale correctly before finalizing.

Multiplying Call Counts:

  • Running code multiple times in a loop can amplify performance patterns, making it easier to identify trends. However, this approach can also create false signals.

Deterministic Testing:

  • Avoid accidental duplication of work and ensure test cases are deterministic. Verify expected invocation counts to ensure the observed behavior aligns with expectations.

Garbage Collection (GC) Sensitivity:

  • GC behavior can vary between environments and heavily influence benchmark results. Custom benchmark runners or tools like js-framework-benchmark or WebKit’s speedometer are recommended for realistic measurements

Caching Effects:

  • Caching can obscure performance signals. Be mindful of how it interacts with benchmarks and verify that results reflect actual improvements rather than artifacts of caching.

Recommendations for Reliable Benchmarking

Focus on Realistic Workloads:

  • Benchmark entire workflows or applications instead of isolated functions.

Understand Your Environment:

  • Be aware of how language runtimes, JITs, and GC implementations affect performance.

Iterate and Validate:

  • Use small benchmarks for quick feedback, but always verify changes in a real-world context.

Use Reliable Tools:

  • Tools like js-framework-benchmark and WebKit’s speedometer are designed to avoid common pitfalls in benchmarking.

By following these principles, you can reduce the risk of misleading results and ensure your benchmarks provide meaningful insights.

--

--

No responses yet