road-data-scraper

Scrapes and Cleans WebTRIS Traffic Flow API

0.0.20

PyPI

Maintainers: 1

Road Data Scraper

The Road Data Scraper is a comprehensive Python tool designed to extract and process data from the WebTRIS Traffic Flow API. It is a complete rewrite of the ONS Road Data Pipeline originally written in R. You can refer to the documentation of the ONS Road Data Pipeline here and to the WebTRIS Traffic Flow API here.

Developer Usage

To get started with the Road Data Scraper, ensure Python 3.9 is installed on your machine. If you're using Anaconda or Miniconda, you can create a virtual environment with Python 3.9 using: conda create --name py39 python=3.9

Clone the repository: git clone https://github.com/dombean/road_data_scraper.git
Navigate into the cloned repository: cd road_data_scraper/
Install the package in editable mode: pip install -e .
Change directory into the package folder: cd src/road_data_scraper/
Adjust the config.ini file according to your requirements
Execute the script: python main.py or python3 main.py

Project Structure

The Road Data Scraper project has the following structure:

├── config.ini
├── main.py
├── setup.cfg
├── setup.py
├── pyproject.toml
├── api_main.py
├── Dockerfile
├── src
│ ├── road_data_scraper
│ │ ├── steps
│ │ │ ├── download.py
│ │ │ ├── file_handler.py
│ │ │ └── metadata.py
│ │ └── report
│ │ ├── report.py
│ │ └── road_data_report_template.ipynb
├── tests
├── requirements.txt
├── requirements_dev.txt
├── tox.ini
└── README.md

The project directory contains the following components:

config.ini: Configuration file for the Road Data Scraper pipeline.
main.py: Main script to run the Road Data Scraper pipeline.
setup.cfg & setup.py & pyproject.toml: Configuration file for the Python package.
api_main.py: Main script for running the Road Data Scraper as a FastAPI application.
Dockerfile: Dockerfile for building a Docker image of the Road Data Scraper.
src: Directory containing the source code of the Road Data Scraper.
- road_data_scraper: Package directory.
  - steps: Module directory containing the main modules for data scraping.
    - download.py: Module for scraping data from the WebTRIS Highways England API.
    - file_handler.py: Module for handling files and directories in the data scraping process.
    - metadata.py: Module for generating metadata for the road traffic sensor data scraping pipeline.
  - report: Module directory for generating HTML reports.
    - report.py: Module for generating HTML reports based on a template Jupyter notebook.
    - road_data_report_template.ipynb: Template Jupyter notebook for generating the HTML report.
requirements.txt: File listing the required Python packages for the project.
requirements_dev.txt: File listing additional development-specific requirements for the project.
tox.ini: Configuration file for running tests using the Tox testing tool.
tests: Directory containing test files for the project.
README.md: Documentation file providing an overview and instructions for using the Road Data Scraper.

The main functionality of the Road Data Scraper resides in the src/road_data_scraper/steps directory, where the core modules for data scraping, file handling, metadata generation, and report generation are located. The road_data_report_template.ipynb file, which serves as the template for generating HTML reports, is placed inside the src/road_data_scraper/report directory.

The additional component, Dockerfile, is located in the root directory. It is used for building a Docker image of the Road Data Scraper, allowing for easy deployment and containerization of the application.

Adjusting the Config File (config.ini)

There are several configurable options in the config.ini file:

start_date: Specify a start date in the format %Y-%m-%d, e.g, "2021-01-01".
end_date: Specify an end date in the format %Y-%m-%d, e.g, "2021-01-31".
test_run: Set to True for testing the pipeline (runs on a subset of available URLs) and False for a complete data download.
generate_report: Set to True to generate a HTML report showcasing the Active and Inactive IDs for each road sensor -- MIDAS, TMU, and TAME.
output_path: Provide a path to save the outputs generated by the Road Data Scraper Pipeline; for example, "/home/user/Documents/"
rm_dir: Set to True if you're using a Google Cloud VM Instance and you don't want to store the data on the VM (assuming you set gcp_storage=True).

Google Cloud (GCP) Storage Options

To save output data to a Google Cloud bucket, adjust the following settings:

gcp_storage: Set to True to save the data generated by the pipeline to a Google Cloud bucket.
gcp_credentials: Provide the path to your GCP credentials json file, e.g., "/home/user/gcp_credentials.json".
gcp_bucket_name: Provide the name of your GCP bucket, e.g., "road_data_scraper_bucket".
gcp_blob_name: Provide the name of the folder in the GCP bucket where you want the pipeline to save the data, e.g., "landing_zone".

Google Cloud VM Instance Setup

Follow the below steps to set up the Road Data Scraper on a Google Cloud VM instance:

Login to Google Cloud Platform and click on Compute Engine in the left side-bar.
Then, in the left side-bar, click on Marketplace and search for Ubuntu 20.04 LTS (Focal), then, click LAUNCH.
Name the instance appropriately; click COMPUTE-OPTIMISED (note: leave the defaults -- 4 vCPU, 16 GB memory); under Firewall, click Allow HTTPS traffic; and finally CREATE the VM instance.
SSH into the VM instance.
Run the following commands: sudo apt-get update && sudo apt-get dist-upgrade -y && sudo apt-get install python3-pip -y && sudo apt-get install wget -y
Pip install the road_data_scraper Package using the command: pip install road_data_scraper
Upload GCP json credentials file.
Download the config.ini file using the command: wget https://raw.githubusercontent.com/dombean/road_data_scraper/main/src/road_data_scraper/config.ini
Download the runner.py file using the command: wget https://raw.githubusercontent.com/dombean/road_data_scraper/main/runner.py
Open runner.py and put in the absolute path to the config.ini file.
Change config.ini parameters accordingly, see README section: Adjusting the Config File (config.ini).
Run the Road Data Scraper Pipeline using the command: python3 runner.py
Login to Google Cloud Platform and click on Compute Engine in the left side-bar.
Click on Marketplace and search for Ubuntu 20.04 LTS (Focal). Click LAUNCH__.
Name your instance, select COMPUTE-OPTIMISED (default settings are recommended), enable HTTPS traffic under Firewall, and CREATE the VM instance.
SSH into the created VM instance.
Update your instance and install necessary packages: sudo apt-get update && sudo apt-get dist-upgrade -y && sudo apt-get install python3-pip -y && sudo apt-get install wget -y
Install the road_data_scraper Package: pip install road_data_scraper
Upload your GCP json credentials file.
Download the config.ini file: wget https://raw.githubusercontent.com/dombean/road_data_scraper/main/src/road_data_scraper/config.ini
Download the runner.py file: wget https://raw.githubusercontent.com/dombean/road_data_scraper/main/runner.py
Update the path to the config.ini file in runner.py.
Adjust the parameters in the config.ini file as per your requirements. Refer to the README section on Adjusting the Config File for more information.
Run the Road Data Scraper Pipeline: python3 runner.py

Google Cloud Run Setup

Ensure Docker and Google Cloud SDK are installed locally. You will also need to authenticate Google Cloud and Docker.

Login to Google Cloud on the command line: gcloud auth login
Configure Google Cloud Project on the command line: gcloud config set project <project-name>
Configure Docker and Google Cloud Credentials: gcloud auth configure-docker

Clone the repository: git clone https://github.com/dombean/road_data_scraper.git
Change directory into the cloned repository: cd road_data_scraper/
Download your Google Cloud JSON Credentials into the repository.
Build the Docker Image: docker build -t road-data-scraper -f Dockerfile .
Test the Docker Image: docker run -it --env PORT=80 -p 80:80 road-data-scraper
Tag the Docker Image: docker tag road-data-scraper eu.gcr.io/<project-name>/road-data-scraper
Push the Docker Image: docker push eu.gcr.io/<project-name>/road-data-scraper
Deploy the Docker Image on Google Cloud Run: gcloud run deploy road-data-scraper --image eu.gcr.io/<project-name>/road-data-scraper --platform managed --region europe-west2 --timeout "3600" --cpu "4" --memory "16Gi" --max-instances "3"

FAQs

What is road-data-scraper?

Is road-data-scraper well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install