GB

0 Basket

BEST SELLERS OFFERS Hobbies & Craft Aviation & Transport Leisure General Interest Sport

United Kingdom

Art & Photography Art Design Architecture Photography Aviation & Transport Motorbikes Aviation Automotive Rail Family & Home Kids Parenting Animals & Pets Food DIY Landscaping & Gardening Property Interior Design & Home Food and Drink Cooking & Baking Drink Vegetarian & Vegan Gluten Free & Special Diets General Interest History & Fact Astronomy Education & Literary Spiritual & Religion Trade & Professional National & Regional Books News & Current Affairs Health & Fitness Medical Running Women's Health Men's Health Slimming Spirituality & Wellbeing Hobbies & Craft Collecting Radio Control Modelling Scale & Millitary Modelling Sewing & Knitting Woodworking Arts & Crafts Leisure Interest Travel Boating & Yachting Poker & Gambling Caravan & Motorhome Camping & Outdoor TV & Film Tattoo Horse Riding & Equestrian Wildlife Men's Interest Lifestyle Gay TV & Film Men's Fitness Motorbikes Automotive Football Fishing & Angling Gaming Gadgets Newspapers All Music Classical Heavy Metal Other Rock Pop Practical & Playing Hi-Fi Sport Football Cycling Rugby Cricket & Golf Football Programmes Fishing & Angling Shooting & Archery Boxing & Martial Arts Horse Riding & Equestrian Other Watersports & Board Athletics & Running Motorsport Ski & Winter Sports Outdoor & Adventure Tech & Gaming Apple Gaming Internet Gadgets PC Mobile Trade & Professional Money & Investment Building & Architecture Military & Defense Education Media Retail News Farming & Agriculture Catering Business Transport Politics Travel Women's Interest Hair Celebrity Weddings & Brides Fashion & Lifestyle Healthy Food & Slimming Fitness

United Kingdom

Digital Subscriptions > Linux Format > December 2023 > Scan and scrape websites using Python

You are currently viewing the United Kingdom version of the site.
Would you like to switch to your local site?

Home My Library My Account Pocketmags Plus+ Title A-Z Category A-Z Best Selling Magazines Latest Offers Gift Vouchers Activate a Subscription Blog Help & Support

Read on any device

Safe & Secure Ordering

9 MIN READ TIME

Scan and scrape websites using Python

David Bolton shows how to safely scrape the Linux Format archives in just 70 lines of Python without incurring our wrath!

OUR EXPERT

David Bolton once accidentally boosted the traffic for his firm’s website by 25% in one day by running a web scraper on it. Luckily, they never found out!

OUR EXPERT

David Bolton once accidentally boosted the traffic for his firm’s website by 25% in one day by running a web scraper on it. Luckily, they never found out!

Ever since the web made an appearance back in the mid-’90s, programmers have been writing software to extract data from web pages. It was quite a bit more difficult in those days, because much of the web was handwritten and inconsistent, plus graphics were used a lot then because CSS didn’t come along for a few years.

The browsers of the time (Netscape and Internet Explorer) were quite forgiving of mistakes, so you could find closing tags wrongly nested or even missing entirely. There was a lot of HTML that needed to be skipped over because it included font information, graphical images and other stuff. Nowadays, the HTML is a lot cleaner.

Web scrapings

A scraper is a program that pretends to be a web browser. When it runs, it fetches one or more HTML pages from a website and processes the pages to extract the desired information. This isn’t always easy, however, for the following reasons: 1. Accessing the data may be tricky – does it require logging in or handling cookies, or does it use POST instead of GET for parameters? (See boxout, below, for more on GET and POST.)

2. It’s someone else’s server, so you need to be gentle accessing it. This means you should definitely not run 20 threads all accessing the same server at the same time.

Unlock this article and much more with

You can enjoy:

	Enjoy this edition in full
	Instant access to 600+ titles
	Thousands of back issues
	No contract or commitment

Try for 99p

30 day trial, then just £9.99 / month. Cancel anytime. New subscribers only.

Learn more

Pocketmags Plus

Pocketmags Plus

More Options:

SUBSCRIBER LOGIN | PRINT OFFERS | DIGITAL OFFERS | DIGITAL BACK ISSUES

SUBSCRIBER LOGIN
PRINT OFFERS
DIGITAL OFFERS
DIGITAL BACK ISSUES

This article is from...

Linux Format

December 2023

VIEW IN STORE

Other Articles in this Issue

Linux Format

Future Publishing Limited, Quay House, The Ambury, Bath, BA1

WELCOME

This issue we’re grappling with Podman to handle our services and make life easier, so what technology have you found that has made things simpler?

Docking apps…

Wasn’t technology supposed to make life easier? Sometimes

REGULARS AT A GLANCE

THIS ISSUE: FOSS collaboration hit by embargo AI power use rockets KDE 6 arrives on desktops Linux boosts Cyberpunk 2077

Jon Masters keeps up with all the latest happenings in the Linux kernel, so you don’t have to.

Got a burning question about open source or the kernel? Whatever your level, email it to answers@linuxformat.com upgraded to v2.0; we’re still finding bugs…

Still very much supported: TurboPrint for Linux. CREDIT:

THE BEST NEW OPEN SOURCE SOFTWARE ON THE PLANET

REVIEWS

REVIEWS Addlink A93 4TB

Shane Downing is always cheap and usually pretty cheerful!

Often rooting for the little guy, Jarred Walton wonders what Intel’s up to.

Shunting all his traffic through the darknet, we’re not sure whether this Nate Drake is real…

SpiralLinux 12.231008

Nate Drake gets his head in a spin over this customised version of Debian, which offers easy setup and proprietary firmware.

Pisi Linux 2.3.4

Nate Drake tastes this Turkish delight, marvelling at its originality and efficiency. Just don’t expect much help if you get stuck.

ROUNDUP

Raspberry Pi 5 killers

Industrial and professional single-board computer use can be a hairy business. Tam Hanna has seen it all, though, and guides you along.

MANAGE YOUR APPS!

MANAGE YOUR APPS!

David Rutland looks at dependency hell and the benefits of containers, and gets to grips with Podman, the newest container tool.

Pi USER

CoderDojo and Code Club become global hits

The help given to young coders is reaching around the world and everyone can get involved!

Dual-Fan Heatsink

During the hot summer months, Les Pounder’s poor Raspberry Pi Zero 2W was sweating in the heat. Can this case keep it cool?

Creality Ender 3 V3 SE

A low-cost, high-speed act has swept Denise Bertacchi off her feet!

Les Pounder dons his trench coat and shades, journeys into The RP2040 Matrix, then realises it’s the wrong matrix and he’s now in his mid-40s.

It seems Denise Bertacchi is in trouble as she’s misplaced her phone…

Create a Pi Pico USB camera trigger

Les Pounder’s drawers are full of old tech. No, he’s not smuggling technology, but he is reusing it in this project!

Control your own custom robotic arm

Matt Holder discovers how to take a standard model, apply some custom electronics and enable it all to be computer-controlled!

IN DEPTH

ENTROPY ISN’T WHAT IT USED TO BE

Nate Drake provides a brief history of randomness in Linux and how the kernel uses it to keep your data safe.

TUTORIALS

Get your life organised!

Shashank Sharma knows more CLI tools than he does people, and is always looking for more. It’s why he misses deadlines. But no more!

Master printing in Linux

Unlucky Nick Peers has to untangle a cornucopia of acronyms to reveal how to get his printer working at its best and in full colour!

Customise your home WordPress site

Michael Reed provides some words to the wise and explains how to add some content to your WordPress website and customise it.

Tweak and tune your own kernel scheduler

Mats Tage Axelsson explains how Linux keeps all its plates up in the air and when it can be helpful to intervene and stop it all crashing down.

Build the ultimate Amiga PC system

Les Pounder turns a lowly 2019 Intel Celeron-based laptop into a high-end Amiga machine – 1994 Les would be so envious!

ADMINISTERIA

Artificial intelligence rules of the road

AI is great, but caution is needed to make sure that it works as expected.

AI coded Bash scripts

Use ChatGPT to make your life easier by asking it the right questions.

CDNetworks CDN Pro

A powerful, programmable CDN with a strong China presence that pleases the notoriously hard-to-please Mike Williams.

A standalone VPN from the people behind Firefox – Mike Williams wonders what’s not to like?

WordPress vs Wix

Running a website needn’t be a technical challenge. Nik Rawlinson compares the two leading platforms that can help anyone create and manage their own pages.

CODING ACADEMY

Process your smarthome sensor data

ADVERTISEMENT

THE BRAIN TUMOUR CHARITY

thebraintumourcharity.org

www.techradar.com/pro/linux

www.codeclub.org.uk

www.magazinesdirect.com/linux-format

www.magazinesdirect.com

Other Links All Titles New Titles Free Magazines Our Publishers Plus+ for Business Privacy Policy California and US Privacy Info Terms & Conditions Cookie Policy My Privacy Choices

Help Help & FAQs Email Support

Gifting How Gifting Works Gifting Help

How it Works Apple Android Online Pocketmags Points Digital Magazines

Contact Us Product Queries Affiliates

Publishers Selling Information Apply to sell Login

The Company About Us Pocketmags.com magazine.co.uk JellyfishCoNNect.com

© Copyright 2011 - 2025 | Jellyfish Connect Ltd

Chat

X

Pocketmags Support

POWERED BY