Mass Downloading 2600’s “Off The Hook” Radio Show

With the recent purchase of my 160 GB iPod classic, I’ve started getting more serious about my podcasts. The “problem” with them is that some of them don’t let you subscribe back too far, and picking through their archive and downloading things one-by-one can sometimes be a chore. To take care of this, I’ve started writing little scripts to to parse out the web sites so that I can easily grab them with wget.

For example, OTH’s radio archive doesn’t allow you to easily run wget (due to the pull down menus), so I used everyone’s favorite web scraper tool, Lynx, to grab the source, grep’d out the URLs in the pull down menus, sed’d these URL fragments into real URLs, and then piped these URLs in a file that was wget-friendly.

lynx -source http://www.2600.com/offthehook/archive_ra.html | grep /offthehook | sed ’s_”>.*__g’ | sed ’s_ __g’ | sed ’s_<optionvalue=”.._http://www.2600.com_g’ | sed ’s_<option selected value=”.._http://www.2600.com_g’ | sed ’s_<optionselectedvalue=”.._http://www.2600.com_g’ | sed ’s_\t__g’ > OTH

wget -r -l1 -t1 -nd -N -A.mp3 -erobots=off -i OTH

(Before I learned to use wget properly, I wasted time to using regex to wget files on their sites. That turned out to be a chore because their website names these files very differently. e.g. yyyymmdd, ddmmyy, yymmdd[a,b,c] (if it was a three part show, etc.)

I suppose my regex monkey work isn’t all wasted. I now have this big jumbled mess of file names that needs a coherent naming solutions, and regex seems to be the only way to go for that. (That’s a separate post once I iron out that script)
update: Here is a bash script by Peter Manis solving this same problem.


About this entry