Parsing Amazon with Hpricot
July 6th, 2006
_why made a really sweet HTML parser called Hpricot. This allows you to easily parse a remote document using Open-URI. Here’s how to do it:
require 'rubygems'
require_gem 'hpricot'
require 'open-uri'
puts "Grabbing Page..."
html = open("http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155")
puts "Parsing..."
doc = Hpricot.parse(html)
(doc.search("//table//td[@id='prodImageCell']")/:img).each do |link|
p link.attributes
end
{"src"=>"http://ec1.images-amazon.com/images/P/1844300439.01._AA240_SCLZZZZZZZ_V54614147_.jpg", "border"=>"0", "id"=>"prodImage", "height"=>"240", "alt"=>"Cobblers", "width"=>"240"}
ruby -rrubygems -ropen-uri -e "require 'hpricot';(Hpricot.parse(open('http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155')).search(\"//table//td[@id='prodImageCell']\")/:img).each {|link| p link.attributes }"
Amazing stuff really. The parser is so amazingly fast. All the time is spent fetching the page, not parsing!
Also, “Sunset, Sunrise” by Razor Ramon is awesome.
