A Firefox cluster driven by JavaScript, Perl, and PL/PgSQL ---- A Firefox X driven by I, I, & I ☺{{#author|agentzh@yahoo.cn}}☺ {{#author|章亦春 (agentzh)}} {{#date|2009.2}} ---- "How about using X in a crawler CI?" "Man, you're C!" ---- CM<✓> We're running X<24> headless firefox processes on X<8> production machines (Linux) and their load is around X<3.0>. CM<✓> We get CI<100,000> web pages crawled and analyzed by my our Firefox cluster X. ---- {{img src="images/cluster-arch.png"}} ---- {{img src="images/firefox-guts.png"}} ---- TAG<☆> We use Firefox X to I Firefox's Gecko CI rather than talk to it from outside. ---- CM KW V = V.getElementById(C<'my-browser'>); KW V = KW BrowserListener(browser); V.register(); KW V = KW OpenResty.Client( { server: C<'http://api.openresty.org'>, user: C<'listhunter.Firefox'> } ); V.callback = doTasks; V.get(C<'/=/view/FirefoxGetTasks/count/200'>); ---- KW doTasks(V, V) { KW (V == KW) V = 0; KW V = V[V]; KW (V == KW) KW; V.loadPage( KW (V, V) { KW (V) { analyze(V.contentDocument); } doTasks(V, V + 1); }, X<3> CM ); } ---- CM<☺> We did CI patch Firefox with only two small X: ➥ Redirect X outputs to CI ➥ Ignore X type CI ---- TAG<☆> The X CI the web page content via the X with cache so that X can load stuffs from the CI directly. ---- {{img src="images/prefetcher-guts.png"}} ---- {{img src="images/proxy-guts.png"}}{{img src="images/proxy-guts2.png"}} ---- CM<☺> I added an CI config directive to X so that it CI overything about RFC. {{img src="images/feather.gif"}} ---- CM<☺> I implemented a X module so that we can have CI cache storage for X {{img src="images/feather.gif"}} ---- V mod_disk_cache + SATA disk 200 ~ 300 QPS mod_disk_cache + tmpfs 400 ~ 500 QPS C X<2200+> QPS ---- {{img src="images/resty-guts.png"}} ---- CM<☺> OpenResty is a X wrapper for PostgreSQL. It is trivial to CI PL/PgSQL functions/stored procedures to the outside world via I without loosing security. {{img src="images/pg.png"}} ---- {{img src="images/resty-cluster.png"}} ---- V ➥ Is the web page a X or a X? ➥ Extract links in the "CI
CI" in list pages. ---- {{img src="images/listhunter-main.png"}} ---- {{img src="images/listhunter-main-list.png"}} ---- V ➥ Extract user X from C web pages ---- {{img src="images/commenthunter-main.png"}} ---- CI from our surfer girls (with 100 random Chinese commercial sites): + Precision ratio: X<97.6%> + Recall ratio: X<91.2.%> ---- V<☺> X-based filters to rule out CI lists {{img src="images/vision.png"}} ---- V.offsetWidth X<*> V.offsetHeight CM V.offsetWidth X V.offsetHeight CM CM V.offsetLeft X<+> absolute x coordiate of V.offsetParent CM V.offsetTop X<+> absolute y coordiate of V.offsetParent ---- {{img src="images/commenthunter-vert.png"}} ---- {{img src="images/commenthunter-hlist.png"}} ---- {{img src="images/commenthunter-linkarea.png"}} ---- V<☺> I is CI but CI for the last filter ---- {{img src="images/commenthunter-rank.png"}} ---- V<♡> X's Test::Simple CI for I {{img src="images/love-letter.jpg"}} ---- {{img src="images/listhunter-tests.png"}} ---- {{img src="images/listhunter-tests2.png"}} ---- {{img src="images/listhunter-tests-text.png"}} ---- Test.GuiMode = false; Test.X(2 * V.length); KW (KW V = 0; V < V.length; V++) { Test.X(V >= 0, C<'i is always non-negative'>); Test.X(V * 2, V + V, C<'i x 2 = i + i'>); } Test.X(); ---- V: X & X code CI {{img src="images/js-icon.jpg"}} {{img src="images/perl_camel.jpg"}} ---- V<$ find js -name '*.js' | xargs wc -l> 27 js/cli-prefs.js 332 js/main.js 3 js/test-data.js 374 js/haiway-miner.js 26 js/box.js 32 js/util.js 7 js/env.js 62 js/benchmark-timer.js 18 js/samples.js 160 js/test.js 329 js/filters.js 151 js/browser-listener.js 137 js/test-more.js X<1658 total> ---- V<$ find lib -name '*.pm' | xargs wc -l> 39 lib/CommentHunter/View/Test.pm 106 lib/CommentHunter/View/Main.pm 34 lib/CommentHunter/View/Overlay.pm 52 lib/CommentHunter/App.pm X<231 total> ---- CI by my X framework {{img src="images/moosecamel.png"}} ---- A CI extension in X {{img src="images/helloworld.jpg"}} ---- CM<# File lib/HelloWorld/App.pm> KW HelloWorld::App; KW V<$$VERSION>; BEGIN { V<$$VERSION> = '0.01' } KW XUL::App::Schema; KW XUL::App schema { X< >> X X xpifile 'helloworld.xpi' => name is 'HelloWorld', id is 'helloworld@agentz.agentz-office', CM<# FIXME> version is V<$$VERSION>, targets { Firefox => ['2.0' => '3.0a5'], CM<# FIXME> }, creator is 'The HelloWorld development team', V<...> ---- V: "We have this CI syntax!" V: "Hey, we do X ;)" ---- CM<# File lib/HelloWorld/View/HelloWin.pm> KW HelloWorld::View::HelloWin; KW 'XUL::App::View::Base'; KW Template::Declare::Tags 'XUL'; template main => KW { show 'header'; CM<# from XUL::App::View::Base> window { attr { id => "helloworld-hellowin", xmlns => V<$::XUL_NAME_SPACE>, title => _('Hello World ') . V<$$HelloWorld::App::VERSION>, V<...> } X