Web Scraper and Background Worker – Anonymous client

Overview

I developed a PHP application that allows the client to easily enter the address of a webpage, and start an automated system in gathering actionable data to increase search engine rankings and pull in more clients.

Client Background

This client is a small web design shop that wanted a tool to scrape webpages/websites to gather useful information the client could use to drive more client traffic while on a small budget.

Project Overview

The project was to create a script (or set of scripts) using PHP to scrape webpages and/or full websites, gather specific information, and report the gathered information to the client for further analysis.

Listing currently queued pages and sites, and the form to add more to the queue.

The client wanted a webpage with a form where they could enter a URL of a page or site to have crawled. The scripts needed the ability to run the scraping in the background so the client doesn’t always have to leave a page opened in a browser.

The solution for this set of requests was two separate scripts: The front-end form, and a back-end worker to gather and process the data.

Services Provided

To keep the project simple, I elected to use a simple text file to manage the queued list of pages and sites to scrape. The front-end form was fairly simple, except for making sure all writes to the queue text file were atomic (the front-end form and the worker could not use the queue file at the same time, or data could be lost or corrupted).

The worker was a little more advanced due to the amount of time it might take to scrape larger sites and thus being run in the background. I decided early on that the script should run, and recover from any crashes automatically (as the client would likely only be accessing the system by the web form). I set the script to run as a Cron job to solve the requirements of running in the background and automatically recovering from a crash. To prevent multiple instances of the worker from fighting each other (or from multiplying until system resources ran out), each time the worker script starts it will check to see if there are already any worker instances. The worker script only begins processing data if it is the only instance running. Also note that the worker script had to atomically pull entries from the queue text file to prevent corruption, duplication, or loss of jobs.

I provided the client with the scripts and documentation, and assisted with troubleshooting issues caused by restrictions created by their web hosting provider.

Impact

As a result of this project, the client now has an efficient and automated way of gathering actionable information they can use to increase their search engine ranking, and increase the number of visitors and clients (both through the increased search engine ranking, and other actions taken as a result of the gathered information).