User-Uptime

Natarajan Santhosh
3 min readDec 7, 2020

--

One person’s uptime is another person’s downtime (‘Is it down for you? It’s down for me…’).

By considering uptime (downtime) from the perspective of individual users, we can combine the strengths of time-based1 and count-based metrics2 to give a meaningful and proportional metric: user-uptime.

User-uptime is a novel availability metric, which combines the advantages of per-user aggregation with those of using a time-based availability measure. See google’s paper for more information. [Google’s Meaningful Availability Paper]

For vs Against

Challenges

  • User-uptime is measurable either at ELB or at Application logs, with pros and cons of each. See the slides the comparison [Meaningful Availability Slides slide 15]
  • Dashboards and alerting need to be defined from scratch — there is no vendor providing examples or a turn-key solution. We currently don’t know which tool but candidates include prometheus and grafana. The data can also be sent to datadog (vendor) which already has a built-in alerting mechanism for us.
  • Sampling — sampling can reduce the complexity of this implementation, but we need to be aware of the sampling error.
  • User — We need to align on whether calculating over each user on the system is useful or calculating of each shard/host on the cluster is sufficient.

Recommendation

Our recommendation is to implement user-uptime at a smaller scale (e.g. on a single cluster) with real time measurement. This metric allows detection of an outage more honestly & meaningfully to our customers. Specifically, windowed user-uptime3 (please refer to google’s paper) enables us to differentiate between many short and fewer but longer outages.

ELB POC [Non-Real Time]

The poc is implemented in ruby. For better performance consider using a compiled language e.g. golang.

The ELB poc is currently a non-real time solution. We had identified a time when a cluster was experiencing an availability issue. A single log file during the time range was downloaded from a s3 bucket, log parsed and below function applied to the data.

In our opinion, computing over a longer duration requires merging ELB logs from multiple load balancers. This was not attempted.

https://gist.github.com/san-nat/856e8b630ae15f11adf68fe427af710b

require 'json'
require 'pry'
require 'time'

uptime_duration=0
downtime_duration=0
user_uptime=0
previousStatusIsSuccess = nil
previousTimeStamp = nil

# meaningful availability https://www.usenix.org/system/files/nsdi20spring_hauer_prepub.pdf

# sample log format
#2020-12-07T22:05:00.451833Z 1.1.1.1:111 200 200
#2020-12-07T22:05:00.915406Z 1.1.1.1:122 304 304
#2020-12-07T22:05:01.874904Z 1.1.1.1:121 304 304

# to extract & create above file from AWS ELB log, I've used miller(
https://github.com/johnkerl/miller).
# run below on command line
# mlr --nidx cut -f 2,4,9,10 515am.log > 515am42.log

File.open("/Users/foobar/Downloads/millertime/prod/prod3_4c.log", "r") do |file_handle|
file_handle.each_line do |line|
e=line.split(' ')
# binding.pry
if(previousTimeStamp.nil?) # is this the first event
previousTimeStamp
=e[0]
previousStatusIsSuccess = e[2].to_i<=499 ? true : false
next
else
duration = Time.parse(e[0])-Time.parse(previousTimeStamp)
if(duration>1800)
previousTimeStamp=e[0]
previousStatusIsSuccess = e[2].to_i<=499 ? true : false
next
end
if
(previousStatusIsSuccess) # if previous and current status is success, add to uptime
uptime_duration
+=duration
else
downtime_duration+=duration
end
previousTimeStamp=e[0]
previousStatusIsSuccess = e[2].to_i<=499 ? true : false
end
end
end

puts "uptime => #{uptime_duration/60}"
puts "downtime => #{downtime_duration/60}"

user_uptime = uptime_duration*100/(uptime_duration+downtime_duration) if uptime_duration>0
puts "user-uptime=> #{user_uptime.round(4)}"

Appendix

1Time-based metric

2 Hyperactive user bias

3 Windowed user-uptime

--

--

No responses yet