A Firefox cluster driven by JavaScript, Perl, and PL/PgSQL ---- A Firefox {{#x|cluster}} driven by {{#i|JavaScript}}, {{#i|Perl}}, & {{#i|PL/PgSQL}} ☺{{#author|agentzh@yahoo.cn}}☺ {{#author|章亦春 (agentzh)}} {{#date|2009.2}} ---- \"How about using {{#x|Firefox}} in a crawler {{#ci|cluster}}?\" \"Man, you're {{#c|crazy}}!\" ---- {{#cm|✓}} We're running {{#x|24}} headless firefox processes on {{#x|8}} production machines (Linux) and their load is around {{#x|3.0}}. {{#cm|✓}} We get {{#ci|100,000}} web pages crawled and analyzed by my our Firefox cluster {{#x|every hour}}. ---- {{img src="#" width="0" height="0"}} {{img src="images/cluster-arch.png" width="678" height="474"}} ---- {{img src="#" width="0" height="0"}} {{img src="images/firefox-guts.png" width="816" height="653"}} ---- {{#tag|☆}} We use Firefox {{#x|extensions}} to {{#i|control}} Firefox's Gecko {{#ci|from inside}} rather than talk to it from outside. ---- {{#cm|/* crawler.js */}} {{#kw|var}} {{#v|browser}} = {{#v|document}}.getElementById({{#c|'my-browser'}}); {{#kw|var}} {{#v|browserListener}} = {{#kw|new}} BrowserListener(browser); {{#v|browserListener}}.register(); {{#kw|var}} {{#v|openresty}} = {{#kw|new}} OpenResty.Client( { server: {{#c|'http://api.openresty.org'}}, user: {{#c|'listhunter.Firefox'}} } ); {{#v|openresty}}.callback = doTasks; {{#v|openresty}}.get({{#c|'/=/view/FirefoxGetTasks/count/200'}}); ---- {{#kw|function}} doTasks({{#v|tasks}}, {{#v|ind}}) { {{#kw|if}} ({{#v|ind}} == {{#kw|null}}) {{#v|ind}} = 0; {{#kw|var}} {{#v|task}} = {{#v|tasks}}[{{#v|ind}}]; {{#kw|if}} ({{#v|task}} == {{#kw|null}}) {{#kw|return}}; {{#v|browserListener}}.loadPage( {{#kw|function}} ({{#v|url}}, {{#v|done}}) { {{#kw|if}} ({{#v|done}}) { analyze({{#v|browser}}.contentDocument); } doTasks({{#v|tasks}}, {{#v|ind}} + 1); }, {{#x|3}} {{#cm|/* timeout in sec */}} ); } ---- {{#cm|☺}} We did {{#ci|NOT}} patch Firefox with only two small {{#x|exceptions}}: + ➥ Redirect {{#x|Error Console}} outputs to {{#ci|stderr}} + ➥ Ignore {{#x|CSS MIME}} type {{#ci|mismatch}} ---- {{#tag|☆}} The {{#x|prefetchers}} {{#ci|prefetch}} the web page content via the {{#x|HTTP proxy}} with cache so that {{#x|Firefox}} can load stuffs from the {{#ci|cache}} directly. ---- {{img src="#" width="0" height="0"}} {{img src="images/prefetcher-guts.png" width="805" height="573"}} ---- {{img src="#" width="0" height="0"}} {{img src="images/proxy-guts.png" width="433" height="339"}}{{img src="images/proxy-guts2.png" width="384" height="339"}} ---- {{#cm|☺}} I added an {{#ci|OverrideExpire}} config directive to {{#x|mod_cache}} so that it {{#ci|forgets}} overything about RFC. {{img src="#" width="0" height="0"}} {{img src="images/feather.gif" width="248" height="70"}} ---- {{#cm|☺}} I implemented a {{#x|mod_libmemcached_cache}} module so that we can have {{#ci|distributive}} cache storage for {{#x|mod_cache}} {{img src="#" width="0" height="0"}} {{img src="images/feather.gif" width="248" height="70"}} ---- {{#v|Sample benchmark with 59 URLs, 200 currency}} mod_disk_cache + SATA disk 200 ~ 300 QPS mod_disk_cache + tmpfs 400 ~ 500 QPS {{#c|mod_libmemcached_cache}} {{#x|2200+}} QPS ---- {{img src="#" width="0" height="0"}} {{img src="images/resty-guts.png" width="1072" height="440"}} ---- {{#cm|☺}} OpenResty is a {{#x|REST}} wrapper for PostgreSQL. It is trivial to {{#ci|expose}} PL/PgSQL functions/stored procedures to the outside world via {{#i|web services}} without loosing security. {{img src="#" width="0" height="0"}} {{img src="images/pg.png" width="200" height="200"}} ---- {{img src="#" width="0" height="0"}} {{img src="images/resty-cluster.png" width="904" height="501"}} ---- {{#v|List Hunter}} ➥ Is the web page a {{#x|list page}} or a {{#x|content page}}? ➥ Extract links in the \"{{#ci|main}} {{#ci|list}}\" in list pages. ---- {{img src="#" width="0" height="0"}} {{img src="images/listhunter-main.png" width="1027" height="702"}} ---- {{img src="#" width="0" height="0"}} {{img src="images/listhunter-main-list.png" width="1027" height="702"}} ---- {{#v|Comment Hunter}} ➥ Extract user {{#x|comments}} from {{#c|arbitrary}} web pages ---- {{img src="#" width="0" height="0"}} {{img src="images/commenthunter-main.png" width="901" height="602"}} ---- {{#ci|Test results}} from our surfer girls (with 100 random Chinese commercial sites): + Precision ratio: {{#x|97.6%}} + Recall ratio: {{#x|91.2.%}} ---- {{#v|☺}} {{#x|Vision}}-based filters to rule out {{#ci|non-comment}} lists {{img src="#" width="0" height="0"}} {{img src="images/vision.png" width="395" height="292"}} ---- {{#v|element}}.offsetWidth {{#x|*}} {{#v|element}}.offsetHeight {{#cm|// node area}} {{#v|element}}.offsetWidth {{#x|/}} {{#v|element}}.offsetHeight {{#cm|// node shape}} {{#cm|// x coordinate of element's left-upper corner}} {{#v|element}}.offsetLeft {{#x|+}} absolute x coordiate of {{#v|element}}.offsetParent {{#cm|// y coordinate of element's left-upper corner}} {{#v|element}}.offsetTop {{#x|+}} absolute y coordiate of {{#v|element}}.offsetParent ---- {{img src="#" width="0" height="0"}} {{img src="images/commenthunter-vert.png" width="667" height="581"}} ---- {{img src="#" width="0" height="0"}} {{img src="images/commenthunter-hlist.png" width="759" height="558"}} ---- {{img src="#" width="0" height="0"}} {{img src="images/commenthunter-linkarea.png" width="759" height="625"}} ---- {{#v|☺}} {{#i|Ranking testing}} is {{#ci|expensive}} but {{#ci|necessary}} for the last filter ---- {{img src="#" width="0" height="0"}} {{img src="images/commenthunter-rank.png" width="808" height="625"}} ---- {{#v|♡}} {{#x|Perl}}'s Test::Simple {{#ci|love}} for {{#i|extension JavaScript}} {{img src="#" width="0" height="0"}} {{img src="images/love-letter.jpg" width="360" height="315"}} ---- {{img src="#" width="0" height="0"}} {{img src="images/listhunter-tests.png" width="808" height="625"}} ---- {{img src="#" width="0" height="0"}} {{img src="images/listhunter-tests2.png" width="816" height="750"}} ---- {{img src="#" width="0" height="0"}} {{img src="images/listhunter-tests-text.png" width="753" height="639"}} ---- Test.GuiMode = false; Test.{{#x|plan}}(2 * {{#v|list}}.length); {{#kw|for}} ({{#kw|var}} {{#v|i}} = 0; {{#v|i}} < {{#v|list}}.length; {{#v|i}}++) { Test.{{#x|ok}}({{#v|i}} >= 0, {{#c|'i is always non-negative'}}); Test.{{#x|is}}({{#v|i}} * 2, {{#v|i}} + {{#v|i}}, {{#c|'i x 2 = i + i'}}); } Test.{{#x|summary}}(); ---- {{#v|Comment Hunter}}: {{#x|JavaScript}} & {{#x|Perl}} code {{#ci|only}} {{img src="#" width="0" height="0"}} {{img src="images/js-icon.jpg" width="125" height="125"}} {{img src="images/perl_camel.jpg" width="250" height="270"}} ---- {{#v|$ find js -name '*.js' | xargs wc -l}} 27 js/cli-prefs.js 332 js/main.js 3 js/test-data.js 374 js/haiway-miner.js 26 js/box.js 32 js/util.js 7 js/env.js 62 js/benchmark-timer.js 18 js/samples.js 160 js/test.js 329 js/filters.js 151 js/browser-listener.js 137 js/test-more.js {{#x|1658 total}} ---- {{#v|$ find lib -name '*.pm' | xargs wc -l}} 39 lib/CommentHunter/View/Test.pm 106 lib/CommentHunter/View/Main.pm 34 lib/CommentHunter/View/Overlay.pm 52 lib/CommentHunter/App.pm {{#x|231 total}} ---- {{#ci|Powered}} by my {{#x|XUL::App}} framework {{img src="#" width="0" height="0"}} {{img src="images/moosecamel.png" width="400" height="194"}} ---- A {{#ci|Hello World}} extension in {{#x|XUL::App}} {{img src="#" width="0" height="0"}} {{img src="images/helloworld.jpg" width="500" height="375"}} ---- {{#cm|# File lib/HelloWorld/App.pm}} {{#kw|package}} HelloWorld::App; {{#kw|our}} {{#v|$$VERSION}}; BEGIN { {{#v|$$VERSION}} = '0.01' } {{#kw|use}} XUL::App::Schema; {{#kw|use}} XUL::App schema { {{#x|xulfile 'hellowin.xul' => }} {{#x|generated from 'HelloWorld::View::HelloWin',}} {{#x|includes qw( jquery.js hellowin.js );}} xpifile 'helloworld.xpi' => name is 'HelloWorld', id is 'helloworld@agentz.agentz-office', {{#cm|# FIXME}} version is {{#v|$$VERSION}}, targets { Firefox => ['2.0' => '3.0a5'], {{#cm|# FIXME}} }, creator is 'The HelloWorld development team', {{#v|...}} ---- {{#v|Ruby}}: \"We have this {{#ci|gorgeous}} syntax!\" {{#v|Perl}}: \"Hey, we do {{#x|as well}} ;)\" ---- {{#cm|# File lib/HelloWorld/View/HelloWin.pm}} {{#kw|package}} HelloWorld::View::HelloWin; {{#kw|use base}} 'XUL::App::View::Base'; {{#kw|use}} Template::Declare::Tags 'XUL'; template main => {{#kw|sub}} { show 'header'; {{#cm|# from XUL::App::View::Base}} window { attr { id => \"helloworld-hellowin\", xmlns => {{#v|$::XUL_NAME_SPACE}}, title => _('Hello World ') . {{#v|$$HelloWorld::App::VERSION}}, {{#v|...}} } {{#x|label { _("Hello, world!") } }} } {{#v|...}} ---- {{#v|$}} {{#ci|xulapp}} {{#x|bundle}} . Writing file hellowin.xul Writing bundle file {{#x|./helloworld.xpi}} {{#v|$}} ---- Our {{#kw|helloworld.xpi}} bundle {{#tag|➥}} {{#cm|✓}} contains {{#ci|0}} Perl {{#cm|✓}} has {{#ci|0}} dependencies (except Firefox itself) {{#cm|✓}} runs happily {{#ci|everywhere}} (Win32, Linux, Mac, and etc.) ---- The {{#ci|future}} {{#cm|✓}} {{#ci|Opensource}} everything we have :) {{#cm|✓}} More hunters, more fun: {{#i|Table Hunter}}, {{#i|Title Hunter}}, {{#i|Ranking Hunter}}, {{#i|Ads Hunter}}, {{#i|Summary Hunter}}, ... {{#cm|✓}} Automatic C/C++ {{#x|XPCOM}} wrapper generator for XUL::App. {{#cm|✓}} Bring Firefox extension love to Apple's {{#ci|WebKit}} (A WebKit crawler cluster?) ---- ☺ {{#ci|Any questions}}? ☺ ----