pdf
|<<
<
>
>>|
/
{{#ci|Streaming}} regex matching and substitution by the {{#x|sregex}} library ☺{{#author|agentzh@gmail.com}}☺ {{#author|Yichun Zhang (agentzh)}} {{img src="images/cloudflare2.gif" width="141" height="71"}} {{#date|2013.06.03}} ---- {{#x|♡}} In {{#ci|efficient}} web servers, request bodies and response bodies are processed in {{#x|data chunks}}. ---- {{img src="images/data-chunks2.png" width="845" height="307"}} ---- {{#x|♡}} We usually use a {{#ci|fixed size}} buffer even we are processing a much {{#x|larger}} data stream. ---- {{img src="images/fixed-size-buffer.png" width="704" height="256"}} ---- {{#x|♡}} {{#ci|Backtracking}} regex engines suck. ---- {{img src="images/backtrack.png" width="704" height="384"}} ---- {{img src="images/backtrack-states.png" width="640" height="704"}} ---- {{#x|♡}} {{#ci|Thompson}}'s Construction Algorithm comes to {{#x|rescue}}! ---- {{img src="images/thompson-states.png" width="704" height="768"}} ---- {{#x|♡}} It also supports {{#ci|submatch}} captures! ---- {{img src="images/thompson-submatch.png" width="768" height="768"}} ---- {{#x|♡}} DFAs {{#ci|cannot}} find the {{#i|beginnings}} of submatch captures without matching {{#x|backwards}}. ---- {{img src="images/dfa-submatch2.png" width="728" height="486"}} ---- {{#x|♡}} I {{#x|created}} the sregex library based on Russ Cox's {{#ci|re1}} library. ---- {{img src="images/sregex-github2.png" width="757" height="406"}} ---- {{#x|♡}} sregex is written in {{#ci|pure}} {{#x|C}}. ---- {{#x|♡}} sregex includes {{#ci|two}} engines: {{#x|Thompson}} VM & {{#x|Pike}} VM. ---- ^ $ \A \z \b \B . \c [0-9a-z] [^0-9a-z] \d \D \s \S \h \H \v \V \w \W \cK \N ab a|b (a) (?:a) a? a* a+ a?? a*? a+? a{n} a{n,m} a{n,} a{n}? a{n,m}? a{n,}? \t \n \r \f ... ---- {{#x|♡}} Passing {{#ci|all}} the related test cases in both the official {{#x|PCRE}} 8.32 and {{#x|Perl}} 5.16.2 {{#i|test suites}}. ---- {{#kw|#include}} <sregex/sregex.h> ... rc = {{#c|sre_vm_pike_exec}}(vm_ctx, {{#x|pos}}, {{#x|len}}, {{#x|last_buf}}, &pending_matched); ---- {{#x|♡}} The {{#x|Thompson}} VM has a simple {{#ci|Just-in-Time}} (JIT) compiler targeting {{#i|x86_64}}. ---- {{#x|♡}} The regex JIT compiler uses {{#ci|DynASM}} which powers {{#x|LuaJIT}}'s interpreter. ---- {{#x|♡}} Still a lot of important {{#ci|optimizations}} to do. ---- {{#x|♡}} My Nginx C module {{#x|ngx_replace_filter}} is the {{#ci|first user}} of sregex. ---- {{img src="images/github-replace-filter2.png" width="760" height="344"}} ---- {{#kw|location}} ~ '\.cpp$' { {{#cm|# proxy_pass ... / fastcgi_pass ...}} {{#cm|# remove all those ugly C/C++ comments:}} {{#kw|replace_filter}} {{#x|'/\*.*?\*/|//[^\n]*'}} {{#c|''}} g; } ---- {{#cm|# skip C/C++ string literals:}} {{#kw|replace_filter}} {{#x|"'(?:\\\\[^\n]|[^'\n])*'"}} {{#v|$&}} g; {{#kw|replace_filter}} {{#x|'"(?:\\\\[^\n]|[^"\n])*"'}} {{#v|$&}} g; ---- {{#kw|replace_filter_max_buffered_size}} 8k; ---- ☺ {{#ci|Thank you}}! ☺