Synthetic Load/Performance Testing

6 min readMar 5, 2020

“If you can’t measure it, you can’t manage it.” Peter Drucker

Database capacity

Increasing load to a database in this manner can reveal the effects load has on your resources, and hopefully expose the point at which your load will begin to affect replication lag.

Caching — disks are the slowest pieces of your infrastructure, which makes accessing them expensive in terms of time. Most large-scale sites alleviate the need for making these expensive operations by caching data in various locations.

Summary

How to create close to as possible real-world production load

100% agree. I've also run across many cases where no-one bothered to even attemp... | Hacker News

Edit description

news.ycombinator.com

Is this a existing customer and they are going to see increase usage by e.g. 4x?

Is this a new customer using unlike any other current users e.g. more concurrent user, much more subaccount than current customers?

Is there a known rampup for usage? Time range when users go from low number to peak e.g. 1k to 89k users between 5AM-8AM

Is there a know steady state concurrent user level? e.g. 89k users on the system from 8AM-2PM

System architecture. How does Web servers, db servers, redis scale? Are they behind a auto scale group? ASG rules? Any know scaling issues noticed in the past? Any document concerns

Ask how does a bottleneck on downstream system affect user facing web application? e.g. if there db lock/avg query times go up in database server how does that effect user facing web app? increase error rates?

Ask/Find during rampUp or Peak Load what are different load/traffic on the system? List of top api endpoints and background jobs? Endpoints/Controller handing most load e.g. (Most used enndpoint*avg_time of each) order by desc.

Understand the database schema. Look at the rowcount of schema in production. Use ratio to seed appropriate data. e.g. if courses table has 1.2million record and we expect customer to experience 4x usage. We should seed 4.8million courses. Not just course all related tables. Is there a import feature that the system allows? e.g. sis import. Use api to seed data so that it is closer to real use case. Using the activerecord is another way(understanding of table/model relation is critical if you decide to go this path). Once you have identify the endpoints to seed the data use tool like jmeter to seed data at scale.

Before starting the load test, ensure that perf test environment is scale and size to as close to production as possible. atleast 2 application servers under the asg, prod sized db, redis rings

Understand latency breakdown for rails controller, postgres, redis, rails cache e.g. using a flamegraph using APM from Datadog

Any know critical integrations to be aware of? e.g. kinesis stream, service for checking feature flags. Does a bottleneck in downstream system have impact of core application performance e.g. user facing web application

Part of performance engineer, knowing about

peak traffic hours
user actions
data size and growth rate
key metrics

We use datadog, splunk and cloudhealth

Think Time

blue bar represents transaction response time + think time(whitespace between individual blue bar)

gaps represents pacing delay between 2 iterations(whitespace before start of next blue bar)

increasing think time and pacing can help achieve max user concurrency and production behavior

Little’s Law:

The long-term average number of customers in a stable system N is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W; or expressed algebraically: N = λW.

For performance testing,

N=Throughtput*(Response time+Think time + Pace Time)

pace time is delay of arrival of next interation

think time is time for user interation

response time is system/api level response time (add multiple api times to simulate a transaction e.g. ordering an item on ecommerce app or navigation to home page+login+performing action)

If we can look at splunk/datadadog or an apm tool to figure out the total concurrent users on the system, avg response time for the transaction,arrival rates(pace time), we can calculate think time

Think time = N/Throughput -(response time +pace time)

e.g.

Business case: A user logs in and edit a blog which takes around 6secs. The total concurrent users at steady state is 80k for 1 hour. The total edit blog call is 8.3million an hour. The delay between iteration is 1s

N=80k concurrent users

throughput in s=8.3m/3600=2300/s

Think time=80k/2.3k-(6+1)=34.7–7=24.7s

Peak Traffic Hours

We use multiple app server clusters(we will focusing on application servers). Each cluster maybe server muliple customers. some cluster have dedicated usually a very large customer

Datadog provides an application server wide dashboard(please check with your SRE team member). Identify the cluster which has most amount of instances

Look at the cluster level datadog app server dashboard and identify the peak hours of user traffic into the system as well as user leaving the system

In this above example

At around 4:40am there are 1200 users on the system which rapidly grows to 40k users around 8:00am

The peak customer activity is between 8am to 1pm

by around 1pm there are 40k users on the system and 3pm drops down to 5k user

Also look if there is pattern,

weekday, weekend, bank holiday pattern, business related seasonal

From a performance testing and scaling perspective, 4:40am to 8am is the critical customer experience time. Your systems should scale up/be ready to service incoming rush of customers.

User Action

Understanding what user actions are happening on the system is critical in creating synthetic load, measuring system performance, optimizing cost vs usage.

NOTE: underutilized resource burns cash

I used splunk to dig into user actions.

index="myapp_iad" sourcetype="_json" cluster=cluster9974 | eval url=mvindex(split(http_request, "?"), 0)| eval url1 = replace(url, "/\d+", "")| eval conca=http_method.";".url1 |stats count, p95(db_time), sum(microseconds), sum(user_cpu), p95(microseconds), p95(user_cpu), median(microseconds), median(user_cpu) by conca

For myapp, use the cluster we had identified in our earlier section. i’m spliting the http_request uri and replacing any id data. Using concatenate to included HTTP_METHOD and evaluated HTTP_REQUEST uri and group them to find

count: count of total requests, use this to identify most common endpoint being used as the users arrive

total user_cpu: our custom apache records user_cpu for each endpoint. I’m counting the total here to identify the which endpoints are cpu hungry

total microseconds: time to service the request, first byte to last byte. I’m counting the total here to identify the which endpoints took the most time.

p95,medians: 95th percentile and medians

Alternative approach: i want to see user action during peak hours

http_method, controller and action

index="myapp_iad" sourcetype="_json" cluster=cluster9974 | eval conca=http_method.";".controller.";".action |stats count, p95(db_time), p95(microseconds), p95(user_cpu), median(microseconds), median(user_cpu) by conca

Visualizing the top 80% of user action during peak hours

Data Size and Growth rate

We use postgres. The infrastructure has read/write, read-only and backup replicas. Each of the above are in different AZ’s for high availability

SELECTpgNamespace.nspname as Schema,pgClass.relname   AS tableName,pgClass.reltuples::bigint AS rowCountFROMpg_class pgClassLEFT JOINpg_namespace pgNamespace ON (pgNamespace.oid = pgClass.relnamespace)WHEREpgNamespace.nspname NOT IN ('pg_catalog', 'information_schema') ANDpgClass.relkind='r' andorder by pgClass.reltuples desc;

Identify endpoints that can spun-off as service

can be independently scaled and more granularity in scaling up core system

move to more performant stack(e.g. rails to java)