Parsing Amazon with Hpricot
_why made a really sweet HTML parser called Hpricot. This allows you to easily parse a remote document using Open-URI. Here’s how to do it: require ‘rubygems’ require_gem ‘hpricot’ require ‘open-uri’ puts "Grabbing Page…" html = open("http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155") puts "Parsing…" doc = Hpricot.parse(html) (doc.search("//table//td[@id='prodImageCell']")/:img).each do |link| p link.attributes end {"src"=>"http://ec1.images-amazon.com/images/P/1844300439.01._AA240_SCLZZZZZZZ_V54614147_.jpg", "border"=>"0", "id"=>"prodImage", "height"=>"240", "alt"=>"Cobblers", "width"=>"240"} ruby -rrubygems -ropen-uri -e “require ‘hpricot’;(Hpricot.parse(open(‘http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155′)).search(\”//table//td[@id='prodImageCell']\”)/:img).each {|link| p link.attributes }” Amazing stuff really. The parser is so amazingly fast. All the time is spent fetching the [...]

