Introduction to
{{#x|off-CPU}} Time {{#i|Flame Graphs}}
  
☺{{#author|agentzh@cloudflare.com}}☺
{{#author|Yichun Zhang (agentzh)}}
  
{{#date|2013.08.22}}
  
----
  
{{#x|♡}} Classic Flame Graphs {{#x|are}}
{{#ci|on}}-CPU time Flame Graphs per se.
  
----
{{img src="image/on-cpu-sketch2.png" width="600" height="280"}}
----
  
{{#x|♡}} We are already relying on them to {{#ci|optimize}}
our Lua {{#x|WAF}} & Lua {{#x|CDN Brain}} (cfcheck)
  
----
{{img src="image/lua-cmsgpack-orig2.png" width="730" height="245"}}
----
{{img src="image/prealloc-local.png" width="463" height="123"}}
----
  
{{#x|♡}} I invented {{#ci|off}}-CPU time Flame Graphs
somewhere near {{#x|Lake Tahoe}} 3 months ago.
----
{{img src="image/lake-tahoe.jpg" width="500" height="375"}}
----
{{img src="image/off-cpu-sketch2.png" width="600" height="225"}}
----
  
{{#x|♡}} I got the {{#ci|inspiration}}
from Brendan Gregg's blog post
\"{{#x|Off-CPU Performance Analysis}}\"
  
----
{{http://dtrace.org/blogs/brendan/2011/07/08/off-cpu-performance-analysis/}}
----
  
{{#x|♡}} Joshua Dankbaar {{#ci|grabbed}} me for an online issue
right after the company {{#x|Kitchen Adventure}}.
  
----
{{#x|♡}} Time to cast a spell over our {{#x|Linux}} boxes by {{#ci|systemtap}}!
  
----
  
{{#x|♡}} I quickly wrote a {{#ci|macro-style}} language {{#x|extension}}
named {{#x|stap++}} for systemtap with a little bit of Perl.
  
{{https://github.com/agentzh/stapxx}}
----
{{img src="image/packets2.png" width="455" height="416"}}
----
  
{{#x|♡}} Nginx workers were badly {{#ci|blocking}} by something
in a {{#x|production box}} in Ashburn
  
----
  
    {{#cm|/* pseudo-code for the nginx event loop */}}
    {{#kw|for}} (;;) {
        ret = epoll_wait(...);
        {{#cm|/* process new events}}
           {{#cm|and expired timers here... */}}
    }
  
----
  
{{#x|♡}} Let's write a simple {{#x|tool}} to trace the long {{#ci|blocking}}
latencies in the Nginx {{#i|event loop}}!
  
    {{#v|$}} vim epoll-looop-blocking.sxx
  
----
  
    {{#cm|#!/usr/bin/env stap++}}
    {{#kw|global}} begin
    {{#kw|probe}} {{#c|syscall.epoll_wait.return}} {
        if (target() == pid()) { begin = gettimeofday_ms() }
    }
    {{#kw|probe}} {{#c|syscall.epoll_wait}} {
        {{#kw|if}} (target() == pid() && begin > 0) {
            elapsed = gettimeofday_ms() - begin
            {{#kw|if}} (elapsed >= {{#v|$^arg_limit}} :{{#kw|default}}(200)) {
                printf(\"[%d] epoll loop blocked for %dms\n\",
                       gettimeofday_s(), elapsed)
            }
        }
    }
  
----
  
    {{#v|$}} ./epoll-loop-blocking.sxx -x {{#c|22845}} --arg limit={{#c|200}}
    Start tracing 22845...
    [1376595038] epoll loop blocked for 208ms
    [1376595040] epoll loop blocked for 485ms
    [1376595044] epoll loop blocked for 336ms
    [1376595049] epoll loop blocked for 734ms
    [1376595057] epoll loop blocked for 379ms
    [1376595061] epoll loop blocked for 227ms
    [1376595062] epoll loop blocked for 212ms
    [1376595066] epoll loop blocked for 390ms
  
----
{{#x|♡}} Is it {{#ci|file IO}} blocking here?
----
  
    {{#cm|# add some code to trace file IO latency at the same time...}}
    {{#kw|global}} vfs_begin
    {{#kw|global}} vfs_latency
    {{#kw|probe}} {{#c|syscall.rename, syscall.open, syscall.sendfile*,}}
          {{#c|vfs.read, vfs.write}}
    {
        {{#kw|if}} (target() == pid()) { vfs_begin = gettimeofday_us() }
    }
    {{#kw|probe}} {{#c|syscall.rename.return, syscall.open.return,}}
          {{#c|syscall.sendfile*.return, vfs.read.return, vfs.write.return}}
    {
        {{#kw|if}} (target() == pid()) {
            vfs_latency += gettimeofday_us() - vfs_begin
        }
    }
  
----
  
    {{#v|$}} ./epoll-loop-blocking-vfs.sxx -x {{#c|22845}} --arg limit={{#c|200}}
    Start tracing 22845...
    [1376596251] epoll loop blocked for 364ms (file IO: 19ms)
    [1376596266] epoll loop blocked for 288ms (file IO: 0ms)
    [1376596270] epoll loop blocked for 1002ms (file IO: 0ms)
    [1376596272] epoll loop blocked for 206ms (file IO: 5ms)
    [1376596280] epoll loop blocked for 218ms (file IO: 211ms)
    [1376596283] epoll loop blocked for 396ms (file IO: 9ms)
  
----
  
{{#x|♡}} Hmm...seems like {{#x|file IO}} is
{{#ci|not}} the major factor here...
  
----
  
{{#x|♡}} I suddenly {{#ci|remember}} my off-CPU time
Flame Graph tool created {{#x|3 months ago}}...
  
----
{{https://github.com/agentzh/nginx-systemtap-toolkit#ngx-sample-bt-off-cpu}}
----
  
    {{#v|$}} ./ngx-sample-bt-off-cpu -t {{#c|10}} -x {{#c|16782}} > a.bt
  
    {{#v|$}} stackcollapse-stap.pl a.bt > a.cbt
    {{#v|$}} flamegraph.pl a.cbt > a.svg
  
----
{{img src="image/off-cpu-flame-graph-nginx-cache2.png" width="729" height="400"}}
----
{{img src="image/flamegraph-v1-local.png" width="694" height="155"}}
----
  
{{#x|♡}} Okay, Nginx was mainly waiting on a {{#ci|lock}}
in an {{#x|obsolete}} code path which was added to Nginx
by one of us (long time ago?)
  
----
  
{{#x|♡}} Let's just {{#ci|remove}} the guilty code path
from our production system!
  
----
{{#x|♡}} Yay! The number of long-running requests
(longer than 1 second) is almost halved!
  
{{img src="image/slow-response-afer-fix-1.png" width="235" height="236"}}
----
  
    {{#v|$}} ./epoll-loop-blocking-vfs.sxx -x {{#c|16738}} --arg limit={{#c|200}}
    Start tracing 16738...
    [1376626387] epoll loop blocked for 456ms (file IO: 455ms)
    [1376626388] epoll loop blocked for 207ms (file IO: 206ms)
    [1376626396] epoll loop blocked for 364ms (file IO: 363ms)
    [1376626402] epoll loop blocked for 350ms (file IO: 349ms)
    [1376626414] epoll loop blocked for 309ms (file IO: 309ms)
  
----
{{img src="image/off-cpu-flame-graph-nginx-cache-v2.png" width="729" height="400"}}
----
{{img src="image/flamegraph-v2-local.png" width="442" height="364"}}
----
  
{{#x|♡}} Okay, now it is {{#ci|file IO}} that's {{#x|killing}} us!
  
----
  
{{#x|♡}} Let's tune Nginx's {{#ci|open_file_cache}} configurations
to save the {{#x|open()}} system calls.
  
----
  
{{#x|♡}} But...wait...we have {{#ci|not}} even
{{#x|enabled}} it yet in production...
  
----
  
    {{#cm|# 2520 is the nginx worker process's pid}}
    {{#v|$}} stap++ -x 2520 \
        -e {{#c|'probe @pfunc(ngx_open_cached_file){printf("%p\n",$cache);exit()}'}}
    {{#x|0x0}}
  
----
  
{{#x|♡}} It is {{#ci|faster}} and more {{#ci|accurate}} than
asking Dane to check {{#x|nginx.conf}}.
  
----
  
{{#x|♡}} Let's start by using the {{#ci|sample}} configuration
in Nginx's {{#x|official}} documentation.
  
    {{#cm|# file nginx.conf}}
    {{#kw|open_file_cache}}  max={{#x|1000}} inactive=20s;
  
----
  
{{#x|♡}} Yay! Our online {{#x|metrics}} immediately showed
even {{#ci|better}} numbers!
  
----
  
{{#x|♡}} What is the cache {{#ci|hit rate}} then?
Can we {{#x|improve}} the cache configurations even further?
  
----
  
    {{#cm|#!/usr/bin/env stap++}}
    {{#kw|global}} misses, total, in_ctx
    {{#kw|probe @pfunc}}(ngx_open_cached_file) {
        {{#kw|if}} (pid() == target()) { in_ctx = 1 total++ }
    }
    {{#kw|probe @pfunc}}(ngx_open_cached_file).return {
        {{#kw|if}} (pid() == target()) { in_ctx = 0 }
    }
    {{#kw|probe @pfunc}}(ngx_open_and_stat_file) {
        {{#kw|if}} (pid() == target() && in_ctx) { misses++ }
    }
    {{#kw|probe end}} {
        printf(\"nginx open file cache miss rate: %d%%\n\", misses * 100 / total)
    }
  
  
----
  
    {{#v|$}} ./ngx-open-file-cache-misses.sxx -x 19642
    WARNING: Start tracing process 19642...
    Hit Ctrl-C to end.
    ^C
    nginx open file cache {{#c|miss}} rate: {{#x|91%}}
  
----
  
{{#x|♡}} So only {{#ci|9% ~ 10%}} cache {{#x|hit rate}}
for open_file_cache in our {{#c|production}} systems.
  
----
  
{{#x|♡}} Let's {{#ci|double}} the cache {{#x|size}}!
  
    {{#cm|# file nginx.conf}}
    {{#kw|open_file_cache}}  max={{#x|2000}} inactive=180s;
  
----
  
    {{#v|$}} ./ngx-open-file-cache-misses.sxx -x 7818
    WARNING: Start tracing process 7818...
    Hit Ctrl-C to end.
    ^C
    nginx open file cache {{#c|miss}} rate: {{#x|79%}}
  
----
  
{{#x|♡}} Yay! The cache {{#ci|hit rate}} is also {{#x|doubled}}!
{{#i|21%}} Now!
  
----
# cache hit rate curve
----
  
{{#x|♡}} Lee said, \"try {{#ci|50k}}!\"
  
----
  
{{#x|♡}} Even a cache size of {{#x|20k}} did {{#ci|not}} fly.
The over-all performance was {{#x|dropping}}!
  
----
{{img src="image/off-cpu-flame-graph-nginx-cache-v4-2.png" width="729" height="400"}}
----
{{img src="image/open-file-cache-locking-local.png" width="546" height="179"}}
----
  
{{#x|♡}} So Nginx's open_file_cache is hopelessly
waiting on shm {{#ci|locks}}
when the cache size is {{#x|large}}.
  
----
  
{{#x|♡}} So Flame Graphs {{#ci|saved}} us again ☺
  
----
  
{{#x|♡}} When we are focusing on optimizing {{#ci|one metric}},
we might {{#ci|introduce}} new bigger {{#x|bottleneck}}
by accident.
  
----
  
{{#x|♡}} Flame Graphs can {{#x|always}} give us
the {{#ci|whole picture}}.
  
----
  
{{#x|♡}} {{#x|Optimizations}} are also all about {{#ci|balance}}.
  
{{img src="image/balance2.png" width="350" height="203"}}
----
  
{{#x|♡}} Nginx's open_file_cache is already a {{#ci|dead end}}.
Let's focus on {{#x|file IO}} itself instead.
  
----
  
    {{#v|$}} ./func-latency-distr.sxx -x 18243 --arg func={{#c|syscall.open}} --arg time={{#c|20}}
    Start tracing 18243...
    Please wait for 20 seconds.
  
----
  
  
  
    Distribution of {{#c|sys_open}} latencies (in microseconds)                                     
    max/avg/min: {{#x|565270}}/2225/5
      value |-------------------------------------------------- count
          8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   731
         16 |@@@@@@@@@@@@@@                                     211
         32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 510
         64 |@@@@                                                65
        128 |                                                     2
        256 |@@@@@@@@@@                                         150
        512 |@@@@@@@                                            119
       1024 |@                                                   21
       2048 |                                                    14
       4096 |                                                     9
       8192 |                                                    10
      16384 |                                                     3
      32768 |                                                     {{#x|9}}
      65536 |                                                     {{#x|4}}
     131072 |                                                     {{#x|3}}
     262144 |                                                     {{#x|5}}
     524288 |                                                     {{#x|1}}
  
  
  
----
  
{{#x|♡}} Knowing how the latency of individual file IO operations
is {{#ci|distributed}}, we can trace the {{#x|details}} of those \"slow samples\".
  
----
  
    {{#v|$}} ./slow-vfs-reads.sxx -x 6954 --arg limit={{#c|100}}
    Start tracing 6954...
    Hit Ctrl-C to end.
    [1377049930] latency=481ms dev={{#x|sde1}} bytes_read={{#x|350}} err=0 errstr=
    [1377049934] latency=497ms dev={{#x|sdc1}} bytes_read={{#x|426}} err=0 errstr=
    [1377049945] latency=234ms dev={{#x|sdf1}} bytes_read={{#x|519}} err=0 errstr=
    [1377049947] latency=995ms dev={{#x|sdb1}} bytes_read={{#x|311}} err=0 errstr=
    [1377049949] latency=208ms dev={{#x|sde1}} bytes_read={{#x|594}} err=0 errstr=
    [1377049949] latency=430ms dev={{#x|sde1}} bytes_read={{#x|4096}} err=0 errstr=
    [1377049949] latency=338ms dev={{#x|sdd1}} bytes_read={{#x|402}} err=0 errstr=
    [1377049950] latency=511ms dev={{#x|sdc1}} bytes_read={{#x|5799}} err=0 errstr=
  
----
  
{{#x|♡}} So the slow samples are distributed {{#ci|evenly}} among all the disk drives,
and the data volumn involved in each call is also quite {{#x|small}}.
  
----
  
{{#x|♡}} {{#ci|Kernel}}-level off-CPU Flame Graphs
  
----
  
    {{#v|$}} ./ngx-sample-bt-off-cpu -p 7635 {{#x|-k}} -t 10 > a.bt
  
----
{{img src="image/kernel-off-cpu-16m3-2.png" width="729" height="293"}}
----
{{img src="image/kernel-flamegraph-local.png" width="828" height="230"}}
----
  
{{#x|♡}} I love Flame Graphs because
they are one kind of {{#ci|visualizations}}
that are truly {{#x|actionable}}.
  
----
  
  
  {{#x|Credits}}
  
  Thanks Brendan Gregg for {{#ci|inventing}} Flame Graphs.
  
  Thanks {{#ci|systemtap}} which was created after dtrace.
  
  Thanks Joshua Dankbaar for {{#ci|walking}} me through
        our production environment.
  
  Thanks Ian Applegate for {{#ci|supporting}} use of
        systemtap in production.
  
  Thanks Dane for {{#ci|pushing}} everyone onto the same page.
  
  
----
  
Systems and systems' laws lay hid in night.
God said, \"let dtrace be!\" and all was light.
  
----
  
☺  {{#ci|Any questions}}? ☺