PERL
How to use Mojolicious for web scraping
Mark Gardner reveals how you can retrieve and parse HTML and XML from websites with a few lines of Perl and the Mojolicious framework.
Part One
Don’t miss next issue! Subscribe on page 16
OUR EXPERT
Mark Gardner is a software developer and blogger with over 25 years of IT experience. You can reach him at www.phoenixtrap. com and @markjgardner.
So much of the modern web is driven by services and front-end interfaces talking to APIs that it’s easy to lose sight of the fact that everything is ultimately presented in a soup of HTML markup. In the absence of a well-structured interface or format, sometimes the code you’re writing needs to scrape the ingredients of that soup apart and parse out meaningful data. Perl’s Mojolicious web framework includes a set of components that make this task easier.
Although most Linux distros come with a version of Perl, it helps to have your own installation separate from the system so you’re not tied to a possibly older version that’s required to support operating system tools and other packages. This separate installation can live in your $HOME directory (or wherever you specify) with its own modules that neither require sudo to install nor interfere with those handled by the package manager.
The most popular tool for managing separate Perl installations is called Perlbrew. Installation instructions are at https://perlbrew.pl/Installation.html. You can install it with either of the following shell commands, depending on what you already have installed: