Home > Parsing Amazon with Hpricot

Parsing Amazon with Hpricot

July 6th, 2006

_why made a really sweet HTML parser called Hpricot. This allows you to easily parse a remote document using Open-URI. Here’s how to do it:


require 'rubygems'
require_gem 'hpricot'
require 'open-uri'
puts "Grabbing Page..."
html = open("http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155")
puts "Parsing..."
doc = Hpricot.parse(html)
(doc.search("//table//td[@id='prodImageCell']")/:img).each do |link|
  p link.attributes
end

{"src"=>"http://ec1.images-amazon.com/images/P/1844300439.01._AA240_SCLZZZZZZZ_V54614147_.jpg", "border"=>"0", "id"=>"prodImage", "height"=>"240", "alt"=>"Cobblers", "width"=>"240"}

ruby -rrubygems -ropen-uri -e "require 'hpricot';(Hpricot.parse(open('http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155')).search(\"//table//td[@id='prodImageCell']\")/:img).each {|link| p link.attributes }"

Amazing stuff really. The parser is so amazingly fast. All the time is spent fetching the page, not parsing!

Also, “Sunset, Sunrise” by Razor Ramon is awesome.


, , , ,

  1. July 6th, 2006 at 01:07 | #1

    Yes, Hpricot is great. I’ve tried it for a while locally and would like to use it on my web apps, but it’s hard to set up on Dreamhost as the gem is not installed there. Any clues?

  2. Hank
    July 6th, 2006 at 01:07 | #2

    You can [set up a gem directory in your home directory.](http://rubygems.org/read/chapter/3#page83)

  1. No trackbacks yet.
Comments are closed.