Parsing Amazon with Hpricot
_why made a really sweet HTML parser called Hpricot. This allows you to easily parse a remote document using Open-URI. Here’s how to do it:
require 'rubygems'
require_gem 'hpricot'
require 'open-uri'
puts "Grabbing Page..."
html = open("http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155")
puts "Parsing..."
doc = Hpricot.parse(html)
(doc.search("//table//td[@id='prodImageCell']")/:img).each do |link|
p link.attributes
end
{"src"=>"http://ec1.images-amazon.com/images/P/1844300439.01._AA240_SCLZZZZZZZ_V54614147_.jpg", "border"=>"0", "id"=>"prodImage", "height"=>"240", "alt"=>"Cobblers", "width"=>"240"}
ruby -rrubygems -ropen-uri -e "require 'hpricot';(Hpricot.parse(open('http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155')).search(\"//table//td[@id='prodImageCell']\")/:img).each {|link| p link.attributes }"
Amazing stuff really. The parser is so amazingly fast. All the time is spent fetching the page, not parsing!
Also, “Sunset, Sunrise” by Razor Ramon is awesome.


Jaime Iniesta
July 6, 2006 at 1:07 AM
Yes, Hpricot is great. I’ve tried it for a while locally and would like to use it on my web apps, but it’s hard to set up on Dreamhost as the gem is not installed there. Any clues?
Hank
July 6, 2006 at 1:07 AM
You can [set up a gem directory in your home directory.](http://rubygems.org/read/chapter/3#page83)