Tag Archives: web

wget mirroring with external references

I was having trouble mirroring a website that had all its images hosted on a different domain, which happened to be random subdomains of cloudfront.net.  I tried adding *.cloudfront.net to the -D parameter, but that didn’t work.  It turns out it’s smart enough to figure out that all subdomains in the domain list should be included as well:

wget -mkpEK -D www.allshepherdrescue.org,cloudfront.net -H -t 3 \
     --restrict-file-names=windows http://www.allshepherdrescue.org/

This goes into mirror mode, changes relative links to the proper form, fixes the query string urls to static ones, and downloads all files from the domains in the -D parameter.  The manpage details all of this.


Ruby to generate RSS feeds for sites that don’t offer them

There’s this site that has an equipment exchange I wanted to keep track of. Yet, it’s done with what seems to be a custom php file rather than vbulletin, so none of the usual RSS feeds from the site apply to it. So, I decided to make a scraper/feed-generator to get me the latest version every 5 minutes and generate a nice RSS feed, so I can view it in Google Reader. The volume of posting is low enough that this won’t be annoying to see in my daily feeds.

I usually use Ruby for this because it offers Hpricot, a very nice and fast scraper and XPath interface. This time, I resolved to find something that does RSS generation better, and I stumbled upon RubyRSS, which happens to be in the core ruby distribution!
Continue reading