I use the perl module WWW::Mechanize as my go-to solution for anything I need to automate over the web. My dad, a web-scraping pro, recommended it as a good starting point when I was just getting started and I’ve used it for pretty much everything since. There’s a few specialized situations where it doesn’t work well, but for most of the sites I need to work with it does great. I used it this past week to help my sister automate the process of signing up campers for her library system’s summer reading program, and thought it would make a good intro for anyone who is interested in getting started with Mechanize.
The goal of the project was to take a spreadsheet with camper information and fill out a registration form for each one. The form looks something like this:
<form method="post" action="splash.asp?id=1" onSubmit="return validateSignUp(this);"> <input type="hidden" name="form_sr" value="1"> <input type="hidden" name="form_postback" value="yes"> <table width="100%" cellpadding="5" cellspacing="2"> <tr class="bggrey"><td colspan="2" class="font12bold centered">Create a Teen Summer Reading account and get started! <span class="font12boldred">Fields in red are required.</a></td></tr> <tr class="bggrey01"> <td width="30%" class="font12boldred" align="right">First Name:</td> <td width="70%"><input type="text" name="form_fname" size="40" value="" onBlur="populateTheOtherInputField(this);"; /></td> </tr> <tr class="bggrey02"> <td class="font12boldred" align="right">Last Name:</td> <td><input type="text" name="form_lname" size="40" value="" onBlur="populateTheOtherInputField(this);"; /></td> </tr> <tr class="bggrey01"> <td class="font12boldred" align="right">Phone Number:</td> <td><input type="text" name="form_phone" size="40" value="" /></td> </tr> <tr class="bggrey02"> <td class="font12bold" align="right">Email:<br /><span class="font10">For easy password recovery.</span></td> <td><input type="text" name="form_email" size="40" value="" /></td> </tr> <!-- A bunch of code related to various drop-down boxes--> <!-- A captcha which doesn't appear if you are on the library network --> <input type="submit" value="Create Account" />
Setting up a mechanize browsing daemon is easy as shown below. I also use SpreadSheet::ParseExcel for the input.
#!/usr/bin/perl -w use WWW::Mechanize; use SpreadSheet::ParseExcel; use strict; use warnings; my ($filename) = @ARGV; my $mech = WWW::Mechanize->new(autocheck=>0,timeout=>5); # Currently this hard codes in the age group my $base = "http://www.cmlibrary.org/programs/summer_reading/2014/splash.asp?id=1"; my $parser = Spreadsheet::ParseExcel->new(); my $workbook = $parser->parse($filename ) or die $parser->error; my $worksheet = $workbook->worksheet('Sheet1') or die "No worksheet found"; my $username = " ";
To do the signup for each camper we loop over the rows of the spreadsheet and pull out the values we need with “get_cell”. Then we load the signup page with mechanize, select the form (the one I needed happened to be the 3rd one on the page). We use “field” to fill in forms, and “select” for drop-down menus (there were a bunch, I’ve only shown one here). Lastly, we click the “Create account” button and we’re all set:
for (my $row=3;$username ne "";$row++) { my $cell = $worksheet->get_cell($row,6) or die "Can't get cell"; $username = $cell->value(); $cell = $worksheet->get_cell($row,7); my $password = $cell->value(); $cell = $worksheet->get_cell($row,0); my $firstname = $cell->value(); $cell = $worksheet->get_cell($row,1); my $lastname = $cell->value(); $cell = $worksheet->get_cell($row,2); my $phonenumber = $cell->value(); if ($username) { print "Registring: " , $username. "\n"; $mech->get($base); if ($mech->success()) { my $regform = $mech->form_number(3); #Fill fields in order of appearance $mech->field("form_fname",$firstname,1); $mech->field("form_lname",$lastname,1); $mech->field("form_phone",$phonenumber,1); $mech->select("form_library", 99); #Default library for now $mech->field("form_username",$username,1); $mech->field("form_password",$password,1); $mech->field("form_password2",$password,1); $mech->field("form_memo","This account was succesfully automatically generated",1); $mech->click_button(value=>"Create Account"); } } }
While I was filling out forms this time, I’m usually pulling data off a web site. This is actually easier than filling out forms — just load the page with “$mech->get($url)” and then access with page source with “my $page = $mech->content” or similar. Normally I read the page in line-by-line and use regex to pull out whatever I’m looking for. (Just remember, you can match known html easily with regex, but don’t try to parse it =).