A Firefox cluster driven by JavaScript, Perl, and PL/PgSQL
----
A Firefox {{#x|cluster}} driven by
{{#i|JavaScript}}, {{#i|Perl}}, & {{#i|PL/PgSQL}}
☺{{#author|agentzh@yahoo.cn}}☺
{{#author|章亦春 (agentzh)}}
{{#date|2009.2}}
----
\"How about using {{#x|Firefox}} in a crawler {{#ci|cluster}}?\"
\"Man, you're {{#c|crazy}}!\"
----
{{#cm|✓}} We're running {{#x|24}} headless firefox processes
on {{#x|8}} production machines (Linux) and their
load is around {{#x|3.0}}.
{{#cm|✓}} We get {{#ci|100,000}} web pages crawled and analyzed
by my our Firefox cluster {{#x|every hour}}.
----
{{img src="#" width="0" height="0"}}
{{img src="images/cluster-arch.png" width="678" height="474"}}
----
{{img src="#" width="0" height="0"}}
{{img src="images/firefox-guts.png" width="816" height="653"}}
----
{{#tag|☆}} We use Firefox {{#x|extensions}} to {{#i|control}} Firefox's Gecko
{{#ci|from inside}} rather than talk to it from outside.
----
{{#cm|/* crawler.js */}}
{{#kw|var}} {{#v|browser}} = {{#v|document}}.getElementById({{#c|'my-browser'}});
{{#kw|var}} {{#v|browserListener}} = {{#kw|new}} BrowserListener(browser);
{{#v|browserListener}}.register();
{{#kw|var}} {{#v|openresty}} = {{#kw|new}} OpenResty.Client(
{
server: {{#c|'http://api.openresty.org'}},
user: {{#c|'listhunter.Firefox'}}
}
);
{{#v|openresty}}.callback = doTasks;
{{#v|openresty}}.get({{#c|'/=/view/FirefoxGetTasks/count/200'}});
----
{{#kw|function}} doTasks({{#v|tasks}}, {{#v|ind}}) {
{{#kw|if}} ({{#v|ind}} == {{#kw|null}}) {{#v|ind}} = 0;
{{#kw|var}} {{#v|task}} = {{#v|tasks}}[{{#v|ind}}];
{{#kw|if}} ({{#v|task}} == {{#kw|null}}) {{#kw|return}};
{{#v|browserListener}}.loadPage(
{{#kw|function}} ({{#v|url}}, {{#v|done}}) {
{{#kw|if}} ({{#v|done}}) {
analyze({{#v|browser}}.contentDocument);
}
doTasks({{#v|tasks}}, {{#v|ind}} + 1);
},
{{#x|3}} {{#cm|/* timeout in sec */}}
);
}
----
{{#cm|☺}} We did {{#ci|NOT}} patch Firefox
with only two small {{#x|exceptions}}:
+ ➥ Redirect {{#x|Error Console}} outputs to {{#ci|stderr}}
+ ➥ Ignore {{#x|CSS MIME}} type {{#ci|mismatch}}
----
{{#tag|☆}} The {{#x|prefetchers}} {{#ci|prefetch}} the web page content
via the {{#x|HTTP proxy}} with cache so that {{#x|Firefox}} can
load stuffs from the {{#ci|cache}} directly.
----
{{img src="#" width="0" height="0"}}
{{img src="images/prefetcher-guts.png" width="805" height="573"}}
----
{{img src="#" width="0" height="0"}}
{{img src="images/proxy-guts.png" width="433" height="339"}}{{img src="images/proxy-guts2.png" width="384" height="339"}}
----
{{#cm|☺}} I added an {{#ci|OverrideExpire}} config directive to {{#x|mod_cache}}
so that it {{#ci|forgets}} overything about RFC.
{{img src="#" width="0" height="0"}}
{{img src="images/feather.gif" width="248" height="70"}}
----
{{#cm|☺}} I implemented a {{#x|mod_libmemcached_cache}} module
so that we can have {{#ci|distributive}} cache storage for {{#x|mod_cache}}
{{img src="#" width="0" height="0"}}
{{img src="images/feather.gif" width="248" height="70"}}
----
{{#v|Sample benchmark with 59 URLs, 200 currency}}
mod_disk_cache + SATA disk 200 ~ 300 QPS
mod_disk_cache + tmpfs 400 ~ 500 QPS
{{#c|mod_libmemcached_cache}} {{#x|2200+}} QPS
----
{{img src="#" width="0" height="0"}}
{{img src="images/resty-guts.png" width="1072" height="440"}}
----
{{#cm|☺}} OpenResty is a {{#x|REST}} wrapper for PostgreSQL.
It is trivial to {{#ci|expose}} PL/PgSQL functions/stored procedures
to the outside world via {{#i|web services}} without loosing security.
{{img src="#" width="0" height="0"}}
{{img src="images/pg.png" width="200" height="200"}}
----
{{img src="#" width="0" height="0"}}
{{img src="images/resty-cluster.png" width="904" height="501"}}
----
{{#v|List Hunter}}
➥ Is the web page a {{#x|list page}} or a {{#x|content page}}?
➥ Extract links in the \"{{#ci|main}} {{#ci|list}}\" in list pages.
----
{{img src="#" width="0" height="0"}}
{{img src="images/listhunter-main.png" width="1027" height="702"}}
----
{{img src="#" width="0" height="0"}}
{{img src="images/listhunter-main-list.png" width="1027" height="702"}}
----
{{#v|Comment Hunter}}
➥ Extract user {{#x|comments}} from
{{#c|arbitrary}} web pages
----
{{img src="#" width="0" height="0"}}
{{img src="images/commenthunter-main.png" width="901" height="602"}}
----
{{#ci|Test results}} from our surfer girls
(with 100 random Chinese commercial sites):
+ Precision ratio: {{#x|97.6%}}
+ Recall ratio: {{#x|91.2.%}}
----
{{#v|☺}} {{#x|Vision}}-based filters to rule out
{{#ci|non-comment}} lists
{{img src="#" width="0" height="0"}}
{{img src="images/vision.png" width="395" height="292"}}
----
{{#v|element}}.offsetWidth {{#x|*}} {{#v|element}}.offsetHeight {{#cm|// node area}}
{{#v|element}}.offsetWidth {{#x|/}} {{#v|element}}.offsetHeight {{#cm|// node shape}}
{{#cm|// x coordinate of element's left-upper corner}}
{{#v|element}}.offsetLeft {{#x|+}} absolute x coordiate of {{#v|element}}.offsetParent
{{#cm|// y coordinate of element's left-upper corner}}
{{#v|element}}.offsetTop {{#x|+}} absolute y coordiate of {{#v|element}}.offsetParent
----
{{img src="#" width="0" height="0"}}
{{img src="images/commenthunter-vert.png" width="667" height="581"}}
----
{{img src="#" width="0" height="0"}}
{{img src="images/commenthunter-hlist.png" width="759" height="558"}}
----
{{img src="#" width="0" height="0"}}
{{img src="images/commenthunter-linkarea.png" width="759" height="625"}}
----
{{#v|☺}} {{#i|Ranking testing}} is {{#ci|expensive}}
but {{#ci|necessary}} for the last filter
----
{{img src="#" width="0" height="0"}}
{{img src="images/commenthunter-rank.png" width="808" height="625"}}
----
{{#v|♡}} {{#x|Perl}}'s Test::Simple {{#ci|love}} for
{{#i|extension JavaScript}}
{{img src="#" width="0" height="0"}}
{{img src="images/love-letter.jpg" width="360" height="315"}}
----
{{img src="#" width="0" height="0"}}
{{img src="images/listhunter-tests.png" width="808" height="625"}}
----
{{img src="#" width="0" height="0"}}
{{img src="images/listhunter-tests2.png" width="816" height="750"}}
----
{{img src="#" width="0" height="0"}}
{{img src="images/listhunter-tests-text.png" width="753" height="639"}}
----
Test.GuiMode = false;
Test.{{#x|plan}}(2 * {{#v|list}}.length);
{{#kw|for}} ({{#kw|var}} {{#v|i}} = 0; {{#v|i}} < {{#v|list}}.length; {{#v|i}}++) {
Test.{{#x|ok}}({{#v|i}} >= 0, {{#c|'i is always non-negative'}});
Test.{{#x|is}}({{#v|i}} * 2, {{#v|i}} + {{#v|i}}, {{#c|'i x 2 = i + i'}});
}
Test.{{#x|summary}}();
----
{{#v|Comment Hunter}}: {{#x|JavaScript}} & {{#x|Perl}} code {{#ci|only}}
{{img src="#" width="0" height="0"}}
{{img src="images/js-icon.jpg" width="125" height="125"}} {{img src="images/perl_camel.jpg" width="250" height="270"}}
----
{{#v|$ find js -name '*.js' | xargs wc -l}}
27 js/cli-prefs.js
332 js/main.js
3 js/test-data.js
374 js/haiway-miner.js
26 js/box.js
32 js/util.js
7 js/env.js
62 js/benchmark-timer.js
18 js/samples.js
160 js/test.js
329 js/filters.js
151 js/browser-listener.js
137 js/test-more.js
{{#x|1658 total}}
----
{{#v|$ find lib -name '*.pm' | xargs wc -l}}
39 lib/CommentHunter/View/Test.pm
106 lib/CommentHunter/View/Main.pm
34 lib/CommentHunter/View/Overlay.pm
52 lib/CommentHunter/App.pm
{{#x|231 total}}
----
{{#ci|Powered}} by my {{#x|XUL::App}} framework
{{img src="#" width="0" height="0"}}
{{img src="images/moosecamel.png" width="400" height="194"}}
----
A {{#ci|Hello World}} extension in {{#x|XUL::App}}
{{img src="#" width="0" height="0"}}
{{img src="images/helloworld.jpg" width="500" height="375"}}
----
{{#cm|# File lib/HelloWorld/App.pm}}
{{#kw|package}} HelloWorld::App;
{{#kw|our}} {{#v|$$VERSION}}; BEGIN { {{#v|$$VERSION}} = '0.01' }
{{#kw|use}} XUL::App::Schema;
{{#kw|use}} XUL::App schema {
{{#x|xulfile 'hellowin.xul' => }}
{{#x|generated from 'HelloWorld::View::HelloWin',}}
{{#x|includes qw( jquery.js hellowin.js );}}
xpifile 'helloworld.xpi' =>
name is 'HelloWorld',
id is 'helloworld@agentz.agentz-office', {{#cm|# FIXME}}
version is {{#v|$$VERSION}},
targets {
Firefox => ['2.0' => '3.0a5'], {{#cm|# FIXME}}
},
creator is 'The HelloWorld development team',
{{#v|...}}
----
{{#v|Ruby}}: \"We have this {{#ci|gorgeous}} syntax!\"
{{#v|Perl}}: \"Hey, we do {{#x|as well}} ;)\"
----
{{#cm|# File lib/HelloWorld/View/HelloWin.pm}}
{{#kw|package}} HelloWorld::View::HelloWin;
{{#kw|use base}} 'XUL::App::View::Base';
{{#kw|use}} Template::Declare::Tags 'XUL';
template main => {{#kw|sub}} {
show 'header'; {{#cm|# from XUL::App::View::Base}}
window {
attr {
id => \"helloworld-hellowin\",
xmlns => {{#v|$::XUL_NAME_SPACE}},
title => _('Hello World ') .
{{#v|$$HelloWorld::App::VERSION}},
{{#v|...}}
}
{{#x|label { _("Hello, world!") } }}
}
{{#v|...}}
----
{{#v|$}} {{#ci|xulapp}} {{#x|bundle}} .
Writing file hellowin.xul
Writing bundle file {{#x|./helloworld.xpi}}
{{#v|$}}
----
Our {{#kw|helloworld.xpi}} bundle {{#tag|➥}}
{{#cm|✓}} contains {{#ci|0}} Perl
{{#cm|✓}} has {{#ci|0}} dependencies
(except Firefox itself)
{{#cm|✓}} runs happily {{#ci|everywhere}}
(Win32, Linux, Mac, and etc.)
----
The {{#ci|future}}
{{#cm|✓}} {{#ci|Opensource}} everything we have :)
{{#cm|✓}} More hunters, more fun: {{#i|Table Hunter}}, {{#i|Title Hunter}},
{{#i|Ranking Hunter}}, {{#i|Ads Hunter}}, {{#i|Summary Hunter}}, ...
{{#cm|✓}} Automatic C/C++ {{#x|XPCOM}} wrapper generator for XUL::App.
{{#cm|✓}} Bring Firefox extension love to Apple's {{#ci|WebKit}}
(A WebKit crawler cluster?)
----
☺ {{#ci|Any questions}}? ☺
----