Synthetic Load/Performance Testing
“If you can’t measure it, you can’t manage it.” Peter Drucker
Database capacity
Increasing load to a database in this manner can reveal the effects load has on your resources, and hopefully expose the point at which your load will begin to affect replication lag.
Caching — disks are the slowest pieces of your infrastructure, which makes accessing them expensive in terms of time. Most large-scale sites alleviate the need for making these expensive operations by caching data in various locations.
Summary
How to create close to as possible real-world production load
Is this a existing customer and they are going to see increase usage by e.g. 4x?
Is this a new customer using unlike any other current users e.g. more concurrent user, much more subaccount than current customers?
Is there a known rampup for usage? Time range when users go from low number to peak e.g. 1k to 89k users between 5AM-8AM
Is there a know steady state concurrent user level? e.g. 89k users on the system from 8AM-2PM
System architecture. How does Web servers, db servers, redis scale? Are they behind a auto scale group? ASG rules? Any know scaling issues noticed in the past? Any document concerns
Ask how does a bottleneck on downstream system affect user facing web application? e.g. if there db lock/avg query times go up in database server how does that effect user facing web app? increase error rates?
Ask/Find during rampUp or Peak Load what are different load/traffic on the system? List of top api endpoints and background jobs? Endpoints/Controller handing most load e.g. (Most used enndpoint*avg_time of each) order by desc.
Understand the database schema. Look at the rowcount of schema in production. Use ratio to seed appropriate data. e.g. if courses table has 1.2million record and we expect customer to experience 4x usage. We should seed 4.8million courses. Not just course all related tables. Is there a import feature that the system allows? e.g. sis import. Use api to seed data so that it is closer to real use case. Using the activerecord is another way(understanding of table/model relation is critical if you decide to go this path). Once you have identify the endpoints to seed the data use tool like jmeter to seed data at scale.
Before starting the load test, ensure that perf test environment is scale and size to as close to production as possible. atleast 2 application servers under the asg, prod sized db, redis rings
Understand latency breakdown for rails controller, postgres, redis, rails cache e.g. using a flamegraph using APM from Datadog
Any know critical integrations to be aware of? e.g. kinesis stream, service for checking feature flags. Does a bottleneck in downstream system have impact of core application performance e.g. user facing web application
Part of performance engineer, knowing about
peak traffic hours
user actions
data size and growth rate
key metrics
We use datadog, splunk and cloudhealth
Think Time
blue bar represents transaction response time + think time(whitespace between individual blue bar)
gaps represents pacing delay between 2 iterations(whitespace before start of next blue bar)
increasing think time and pacing can help achieve max user concurrency and production behavior
Little’s Law:
The long-term average number of customers in a stable system N is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W; or expressed algebraically: N = λW.
For performance testing,
N=Throughtput*(Response time+Think time + Pace Time)
pace time is delay of arrival of next interation
think time is time for user interation
response time is system/api level response time (add multiple api times to simulate a transaction e.g. ordering an item on ecommerce app or navigation to home page+login+performing action)
If we can look at splunk/datadadog or an apm tool to figure out the total concurrent users on the system, avg response time for the transaction,arrival rates(pace time), we can calculate think time
Think time = N/Throughput -(response time +pace time)
e.g.
Business case: A user logs in and edit a blog which takes around 6secs. The total concurrent users at steady state is 80k for 1 hour. The total edit blog call is 8.3million an hour. The delay between iteration is 1s
N=80k concurrent users
throughput in s=8.3m/3600=2300/s
Think time=80k/2.3k-(6+1)=34.7–7=24.7s
Peak Traffic Hours
We use multiple app server clusters(we will focusing on application servers). Each cluster maybe server muliple customers. some cluster have dedicated usually a very large customer
Datadog provides an application server wide dashboard(please check with your SRE team member). Identify the cluster which has most amount of instances
Look at the cluster level datadog app server dashboard and identify the peak hours of user traffic into the system as well as user leaving the system
In this above example
At around 4:40am there are 1200 users on the system which rapidly grows to 40k users around 8:00am
The peak customer activity is between 8am to 1pm
by around 1pm there are 40k users on the system and 3pm drops down to 5k user
Also look if there is pattern,
weekday, weekend, bank holiday pattern, business related seasonal
From a performance testing and scaling perspective, 4:40am to 8am is the critical customer experience time. Your systems should scale up/be ready to service incoming rush of customers.
User Action
Understanding what user actions are happening on the system is critical in creating synthetic load, measuring system performance, optimizing cost vs usage.
NOTE: underutilized resource burns cash
I used splunk to dig into user actions.
index="myapp_iad" sourcetype="_json" cluster=cluster9974 | eval url=mvindex(split(http_request, "?"), 0)| eval url1 = replace(url, "/\d+", "")| eval conca=http_method.";".url1 |stats count, p95(db_time), sum(microseconds), sum(user_cpu), p95(microseconds), p95(user_cpu), median(microseconds), median(user_cpu) by conca
For myapp, use the cluster we had identified in our earlier section. i’m spliting the http_request uri and replacing any id data. Using concatenate to included HTTP_METHOD and evaluated HTTP_REQUEST uri and group them to find
count: count of total requests, use this to identify most common endpoint being used as the users arrive
total user_cpu: our custom apache records user_cpu for each endpoint. I’m counting the total here to identify the which endpoints are cpu hungry
total microseconds: time to service the request, first byte to last byte. I’m counting the total here to identify the which endpoints took the most time.
p95,medians: 95th percentile and medians
Alternative approach: i want to see user action during peak hours
http_method, controller and action
index="myapp_iad" sourcetype="_json" cluster=cluster9974 | eval conca=http_method.";".controller.";".action |stats count, p95(db_time), p95(microseconds), p95(user_cpu), median(microseconds), median(user_cpu) by conca
Data Size and Growth rate
We use postgres. The infrastructure has read/write, read-only and backup replicas. Each of the above are in different AZ’s for high availability
SELECTpgNamespace.nspname as Schema,pgClass.relname AS tableName,pgClass.reltuples::bigint AS rowCountFROMpg_class pgClassLEFT JOINpg_namespace pgNamespace ON (pgNamespace.oid = pgClass.relnamespace)WHEREpgNamespace.nspname NOT IN ('pg_catalog', 'information_schema') ANDpgClass.relkind='r' andorder by pgClass.reltuples desc;
Identify endpoints that can spun-off as service
can be independently scaled and more granularity in scaling up core system
move to more performant stack(e.g. rails to java)