pdf
|<<
<
>
>>|
/
Introduction to {{#x|off-CPU}} Time {{#i|Flame Graphs}} ☺{{#author|agentzh@cloudflare.com}}☺ {{#author|Yichun Zhang (agentzh)}} {{#date|2013.08.22}} ---- {{#x|♡}} Classic Flame Graphs {{#x|are}} {{#ci|on}}-CPU time Flame Graphs per se. ---- {{img src="image/on-cpu-sketch2.png" width="600" height="280"}} ---- {{#x|♡}} We are already relying on them to {{#ci|optimize}} our Lua {{#x|WAF}} & Lua {{#x|CDN Brain}} (cfcheck) ---- {{img src="image/lua-cmsgpack-orig2.png" width="730" height="245"}} ---- {{img src="image/prealloc-local.png" width="463" height="123"}} ---- {{#x|♡}} I invented {{#ci|off}}-CPU time Flame Graphs somewhere near {{#x|Lake Tahoe}} 3 months ago. ---- {{img src="image/lake-tahoe.jpg" width="500" height="375"}} ---- {{img src="image/off-cpu-sketch2.png" width="600" height="225"}} ---- {{#x|♡}} I got the {{#ci|inspiration}} from Brendan Gregg's blog post \"{{#x|Off-CPU Performance Analysis}}\" ---- {{http://dtrace.org/blogs/brendan/2011/07/08/off-cpu-performance-analysis/}} ---- {{#x|♡}} Joshua Dankbaar {{#ci|grabbed}} me for an online issue right after the company {{#x|Kitchen Adventure}}. ---- {{#x|♡}} Time to cast a spell over our {{#x|Linux}} boxes by {{#ci|systemtap}}! ---- {{#x|♡}} I quickly wrote a {{#ci|macro-style}} language {{#x|extension}} named {{#x|stap++}} for systemtap with a little bit of Perl. {{https://github.com/agentzh/stapxx}} ---- {{img src="image/packets2.png" width="455" height="416"}} ---- {{#x|♡}} Nginx workers were badly {{#ci|blocking}} by something in a {{#x|production box}} in Ashburn ---- {{#cm|/* pseudo-code for the nginx event loop */}} {{#kw|for}} (;;) { ret = epoll_wait(...); {{#cm|/* process new events}} {{#cm|and expired timers here... */}} } ---- {{#x|♡}} Let's write a simple {{#x|tool}} to trace the long {{#ci|blocking}} latencies in the Nginx {{#i|event loop}}! {{#v|$}} vim epoll-looop-blocking.sxx ---- {{#cm|#!/usr/bin/env stap++}} {{#kw|global}} begin {{#kw|probe}} {{#c|syscall.epoll_wait.return}} { if (target() == pid()) { begin = gettimeofday_ms() } } {{#kw|probe}} {{#c|syscall.epoll_wait}} { {{#kw|if}} (target() == pid() && begin > 0) { elapsed = gettimeofday_ms() - begin {{#kw|if}} (elapsed >= {{#v|$^arg_limit}} :{{#kw|default}}(200)) { printf(\"[%d] epoll loop blocked for %dms\n\", gettimeofday_s(), elapsed) } } } ---- {{#v|$}} ./epoll-loop-blocking.sxx -x {{#c|22845}} --arg limit={{#c|200}} Start tracing 22845... [1376595038] epoll loop blocked for 208ms [1376595040] epoll loop blocked for 485ms [1376595044] epoll loop blocked for 336ms [1376595049] epoll loop blocked for 734ms [1376595057] epoll loop blocked for 379ms [1376595061] epoll loop blocked for 227ms [1376595062] epoll loop blocked for 212ms [1376595066] epoll loop blocked for 390ms ---- {{#x|♡}} Is it {{#ci|file IO}} blocking here? ---- {{#cm|# add some code to trace file IO latency at the same time...}} {{#kw|global}} vfs_begin {{#kw|global}} vfs_latency {{#kw|probe}} {{#c|syscall.rename, syscall.open, syscall.sendfile*,}} {{#c|vfs.read, vfs.write}} { {{#kw|if}} (target() == pid()) { vfs_begin = gettimeofday_us() } } {{#kw|probe}} {{#c|syscall.rename.return, syscall.open.return,}} {{#c|syscall.sendfile*.return, vfs.read.return, vfs.write.return}} { {{#kw|if}} (target() == pid()) { vfs_latency += gettimeofday_us() - vfs_begin } } ---- {{#v|$}} ./epoll-loop-blocking-vfs.sxx -x {{#c|22845}} --arg limit={{#c|200}} Start tracing 22845... [1376596251] epoll loop blocked for 364ms (file IO: 19ms) [1376596266] epoll loop blocked for 288ms (file IO: 0ms) [1376596270] epoll loop blocked for 1002ms (file IO: 0ms) [1376596272] epoll loop blocked for 206ms (file IO: 5ms) [1376596280] epoll loop blocked for 218ms (file IO: 211ms) [1376596283] epoll loop blocked for 396ms (file IO: 9ms) ---- {{#x|♡}} Hmm...seems like {{#x|file IO}} is {{#ci|not}} the major factor here... ---- {{#x|♡}} I suddenly {{#ci|remember}} my off-CPU time Flame Graph tool created {{#x|3 months ago}}... ---- {{https://github.com/agentzh/nginx-systemtap-toolkit#ngx-sample-bt-off-cpu}} ---- {{#v|$}} ./ngx-sample-bt-off-cpu -t {{#c|10}} -x {{#c|16782}} > a.bt {{#v|$}} stackcollapse-stap.pl a.bt > a.cbt {{#v|$}} flamegraph.pl a.cbt > a.svg ---- {{img src="image/off-cpu-flame-graph-nginx-cache2.png" width="729" height="400"}} ---- {{img src="image/flamegraph-v1-local.png" width="694" height="155"}} ---- {{#x|♡}} Okay, Nginx was mainly waiting on a {{#ci|lock}} in an {{#x|obsolete}} code path which was added to Nginx by one of us (long time ago?) ---- {{#x|♡}} Let's just {{#ci|remove}} the guilty code path from our production system! ---- {{#x|♡}} Yay! The number of long-running requests (longer than 1 second) is almost halved! {{img src="image/slow-response-afer-fix-1.png" width="235" height="236"}} ---- {{#v|$}} ./epoll-loop-blocking-vfs.sxx -x {{#c|16738}} --arg limit={{#c|200}} Start tracing 16738... [1376626387] epoll loop blocked for 456ms (file IO: 455ms) [1376626388] epoll loop blocked for 207ms (file IO: 206ms) [1376626396] epoll loop blocked for 364ms (file IO: 363ms) [1376626402] epoll loop blocked for 350ms (file IO: 349ms) [1376626414] epoll loop blocked for 309ms (file IO: 309ms) ---- {{img src="image/off-cpu-flame-graph-nginx-cache-v2.png" width="729" height="400"}} ---- {{img src="image/flamegraph-v2-local.png" width="442" height="364"}} ---- {{#x|♡}} Okay, now it is {{#ci|file IO}} that's {{#x|killing}} us! ---- {{#x|♡}} Let's tune Nginx's {{#ci|open_file_cache}} configurations to save the {{#x|open()}} system calls. ---- {{#x|♡}} But...wait...we have {{#ci|not}} even {{#x|enabled}} it yet in production... ---- {{#cm|# 2520 is the nginx worker process's pid}} {{#v|$}} stap++ -x 2520 \ -e {{#c|'probe @pfunc(ngx_open_cached_file){printf("%p\n",$cache);exit()}'}} {{#x|0x0}} ---- {{#x|♡}} It is {{#ci|faster}} and more {{#ci|accurate}} than asking Dane to check {{#x|nginx.conf}}. ---- {{#x|♡}} Let's start by using the {{#ci|sample}} configuration in Nginx's {{#x|official}} documentation. {{#cm|# file nginx.conf}} {{#kw|open_file_cache}} max={{#x|1000}} inactive=20s; ---- {{#x|♡}} Yay! Our online {{#x|metrics}} immediately showed even {{#ci|better}} numbers! ---- {{#x|♡}} What is the cache {{#ci|hit rate}} then? Can we {{#x|improve}} the cache configurations even further? ---- {{#cm|#!/usr/bin/env stap++}} {{#kw|global}} misses, total, in_ctx {{#kw|probe @pfunc}}(ngx_open_cached_file) { {{#kw|if}} (pid() == target()) { in_ctx = 1 total++ } } {{#kw|probe @pfunc}}(ngx_open_cached_file).return { {{#kw|if}} (pid() == target()) { in_ctx = 0 } } {{#kw|probe @pfunc}}(ngx_open_and_stat_file) { {{#kw|if}} (pid() == target() && in_ctx) { misses++ } } {{#kw|probe end}} { printf(\"nginx open file cache miss rate: %d%%\n\", misses * 100 / total) } ---- {{#v|$}} ./ngx-open-file-cache-misses.sxx -x 19642 WARNING: Start tracing process 19642... Hit Ctrl-C to end. ^C nginx open file cache {{#c|miss}} rate: {{#x|91%}} ---- {{#x|♡}} So only {{#ci|9% ~ 10%}} cache {{#x|hit rate}} for open_file_cache in our {{#c|production}} systems. ---- {{#x|♡}} Let's {{#ci|double}} the cache {{#x|size}}! {{#cm|# file nginx.conf}} {{#kw|open_file_cache}} max={{#x|2000}} inactive=180s; ---- {{#v|$}} ./ngx-open-file-cache-misses.sxx -x 7818 WARNING: Start tracing process 7818... Hit Ctrl-C to end. ^C nginx open file cache {{#c|miss}} rate: {{#x|79%}} ---- {{#x|♡}} Yay! The cache {{#ci|hit rate}} is also {{#x|doubled}}! {{#i|21%}} Now! ---- # cache hit rate curve ---- {{#x|♡}} Lee said, \"try {{#ci|50k}}!\" ---- {{#x|♡}} Even a cache size of {{#x|20k}} did {{#ci|not}} fly. The over-all performance was {{#x|dropping}}! ---- {{img src="image/off-cpu-flame-graph-nginx-cache-v4-2.png" width="729" height="400"}} ---- {{img src="image/open-file-cache-locking-local.png" width="546" height="179"}} ---- {{#x|♡}} So Nginx's open_file_cache is hopelessly waiting on shm {{#ci|locks}} when the cache size is {{#x|large}}. ---- {{#x|♡}} So Flame Graphs {{#ci|saved}} us again ☺ ---- {{#x|♡}} When we are focusing on optimizing {{#ci|one metric}}, we might {{#ci|introduce}} new bigger {{#x|bottleneck}} by accident. ---- {{#x|♡}} Flame Graphs can {{#x|always}} give us the {{#ci|whole picture}}. ---- {{#x|♡}} {{#x|Optimizations}} are also all about {{#ci|balance}}. {{img src="image/balance2.png" width="350" height="203"}} ---- {{#x|♡}} Nginx's open_file_cache is already a {{#ci|dead end}}. Let's focus on {{#x|file IO}} itself instead. ---- {{#v|$}} ./func-latency-distr.sxx -x 18243 --arg func={{#c|syscall.open}} --arg time={{#c|20}} Start tracing 18243... Please wait for 20 seconds. ---- Distribution of {{#c|sys_open}} latencies (in microseconds) max/avg/min: {{#x|565270}}/2225/5 value |-------------------------------------------------- count 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 731 16 |@@@@@@@@@@@@@@ 211 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 510 64 |@@@@ 65 128 | 2 256 |@@@@@@@@@@ 150 512 |@@@@@@@ 119 1024 |@ 21 2048 | 14 4096 | 9 8192 | 10 16384 | 3 32768 | {{#x|9}} 65536 | {{#x|4}} 131072 | {{#x|3}} 262144 | {{#x|5}} 524288 | {{#x|1}} ---- {{#x|♡}} Knowing how the latency of individual file IO operations is {{#ci|distributed}}, we can trace the {{#x|details}} of those \"slow samples\". ---- {{#v|$}} ./slow-vfs-reads.sxx -x 6954 --arg limit={{#c|100}} Start tracing 6954... Hit Ctrl-C to end. [1377049930] latency=481ms dev={{#x|sde1}} bytes_read={{#x|350}} err=0 errstr= [1377049934] latency=497ms dev={{#x|sdc1}} bytes_read={{#x|426}} err=0 errstr= [1377049945] latency=234ms dev={{#x|sdf1}} bytes_read={{#x|519}} err=0 errstr= [1377049947] latency=995ms dev={{#x|sdb1}} bytes_read={{#x|311}} err=0 errstr= [1377049949] latency=208ms dev={{#x|sde1}} bytes_read={{#x|594}} err=0 errstr= [1377049949] latency=430ms dev={{#x|sde1}} bytes_read={{#x|4096}} err=0 errstr= [1377049949] latency=338ms dev={{#x|sdd1}} bytes_read={{#x|402}} err=0 errstr= [1377049950] latency=511ms dev={{#x|sdc1}} bytes_read={{#x|5799}} err=0 errstr= ---- {{#x|♡}} So the slow samples are distributed {{#ci|evenly}} among all the disk drives, and the data volumn involved in each call is also quite {{#x|small}}. ---- {{#x|♡}} {{#ci|Kernel}}-level off-CPU Flame Graphs ---- {{#v|$}} ./ngx-sample-bt-off-cpu -p 7635 {{#x|-k}} -t 10 > a.bt ---- {{img src="image/kernel-off-cpu-16m3-2.png" width="729" height="293"}} ---- {{img src="image/kernel-flamegraph-local.png" width="828" height="230"}} ---- {{#x|♡}} I love Flame Graphs because they are one kind of {{#ci|visualizations}} that are truly {{#x|actionable}}. ---- {{#x|Credits}} Thanks Brendan Gregg for {{#ci|inventing}} Flame Graphs. Thanks {{#ci|systemtap}} which was created after dtrace. Thanks Joshua Dankbaar for {{#ci|walking}} me through our production environment. Thanks Ian Applegate for {{#ci|supporting}} use of systemtap in production. Thanks Dane for {{#ci|pushing}} everyone onto the same page. ---- Systems and systems' laws lay hid in night. God said, \"let dtrace be!\" and all was light. ---- ☺ {{#ci|Any questions}}? ☺