OUR EXPERT
David Bolton once accidentally boosted the traffic for his firm’s website by 25% in one day by running a web scraper on it. Luckily, they never found out!
OUR EXPERT
David Bolton once accidentally boosted the traffic for his firm’s website by 25% in one day by running a web scraper on it. Luckily, they never found out!
Ever since the web made an appearance back in the mid-’90s, programmers have been writing software to extract data from web pages. It was quite a bit more difficult in those days, because much of the web was handwritten and inconsistent, plus graphics were used a lot then because CSS didn’t come along for a few years.
The browsers of the time (Netscape and Internet Explorer) were quite forgiving of mistakes, so you could find closing tags wrongly nested or even missing entirely. There was a lot of HTML that needed to be skipped over because it included font information, graphical images and other stuff. Nowadays, the HTML is a lot cleaner.
Web scrapings
A scraper is a program that pretends to be a web browser. When it runs, it fetches one or more HTML pages from a website and processes the pages to extract the desired information. This isn’t always easy, however, for the following reasons: 1. Accessing the data may be tricky – does it require logging in or handling cookies, or does it use POST instead of GET for parameters? (See boxout, below, for more on GET and POST.)
2.
It’s someone else’s server, so you need to be gentle accessing it. This means you should definitely not run 20 threads all accessing the same server at the same time.