
Research
NPM targeted by malware campaign mimicking familiar library names
Socket uncovered npm malware campaign mimicking popular Node.js libraries and packages from other ecosystems; packages steal data and execute remote code.
The Road Data Scraper is a comprehensive Python tool designed to extract and process data from the WebTRIS Traffic Flow API. It is a complete rewrite of the ONS Road Data Pipeline originally written in R. You can refer to the documentation of the ONS Road Data Pipeline here and to the WebTRIS Traffic Flow API here.
To get started with the Road Data Scraper, ensure Python 3.9 is installed on your machine. If you're using Anaconda or Miniconda, you can create a virtual environment with Python 3.9 using: conda create --name py39 python=3.9
git clone https://github.com/dombean/road_data_scraper.git
cd road_data_scraper/
pip install -e .
cd src/road_data_scraper/
python main.py
or python3 main.py
The Road Data Scraper project has the following structure:
├── config.ini
├── main.py
├── setup.cfg
├── setup.py
├── pyproject.toml
├── api_main.py
├── Dockerfile
├── src
│ ├── road_data_scraper
│ │ ├── steps
│ │ │ ├── download.py
│ │ │ ├── file_handler.py
│ │ │ └── metadata.py
│ │ └── report
│ │ ├── report.py
│ │ └── road_data_report_template.ipynb
├── tests
├── requirements.txt
├── requirements_dev.txt
├── tox.ini
└── README.md
The project directory contains the following components:
config.ini
: Configuration file for the Road Data Scraper pipeline.main.py
: Main script to run the Road Data Scraper pipeline.setup.cfg
& setup.py
& pyproject.toml
: Configuration file for the Python package.api_main.py
: Main script for running the Road Data Scraper as a FastAPI application.Dockerfile
: Dockerfile for building a Docker image of the Road Data Scraper.src
: Directory containing the source code of the Road Data Scraper.
road_data_scraper
: Package directory.
steps
: Module directory containing the main modules for data scraping.
download.py
: Module for scraping data from the WebTRIS Highways England API.file_handler.py
: Module for handling files and directories in the data scraping process.metadata.py
: Module for generating metadata for the road traffic sensor data scraping pipeline.report
: Module directory for generating HTML reports.
report.py
: Module for generating HTML reports based on a template Jupyter notebook.road_data_report_template.ipynb
: Template Jupyter notebook for generating the HTML report.requirements.txt
: File listing the required Python packages for the project.requirements_dev.txt
: File listing additional development-specific requirements for the project.tox.ini
: Configuration file for running tests using the Tox testing tool.tests
: Directory containing test files for the project.README.md
: Documentation file providing an overview and instructions for using the Road Data Scraper.The main functionality of the Road Data Scraper resides in the src/road_data_scraper/steps
directory, where the core modules for data scraping, file handling, metadata generation, and report generation are located. The road_data_report_template.ipynb
file, which serves as the template for generating HTML reports, is placed inside the src/road_data_scraper/report
directory.
The additional component, Dockerfile
, is located in the root directory. It is used for building a Docker image of the Road Data Scraper, allowing for easy deployment and containerization of the application.
There are several configurable options in the config.ini file:
To save output data to a Google Cloud bucket, adjust the following settings:
Follow the below steps to set up the Road Data Scraper on a Google Cloud VM instance:
Login to Google Cloud Platform and click on Compute Engine in the left side-bar.
Then, in the left side-bar, click on Marketplace and search for Ubuntu 20.04 LTS (Focal), then, click LAUNCH.
Name the instance appropriately; click COMPUTE-OPTIMISED (note: leave the defaults -- 4 vCPU, 16 GB memory); under Firewall, click Allow HTTPS traffic; and finally CREATE the VM instance.
SSH into the VM instance.
Run the following commands: sudo apt-get update && sudo apt-get dist-upgrade -y && sudo apt-get install python3-pip -y && sudo apt-get install wget -y
Pip install the road_data_scraper Package using the command: pip install road_data_scraper
Upload GCP json credentials file.
Download the config.ini file using the command: wget https://raw.githubusercontent.com/dombean/road_data_scraper/main/src/road_data_scraper/config.ini
Download the runner.py file using the command: wget https://raw.githubusercontent.com/dombean/road_data_scraper/main/runner.py
Open runner.py and put in the absolute path to the config.ini file.
Change config.ini parameters accordingly, see README section: Adjusting the Config File (config.ini).
Run the Road Data Scraper Pipeline using the command: python3 runner.py
Login to Google Cloud Platform and click on Compute Engine in the left side-bar.
Click on Marketplace and search for Ubuntu 20.04 LTS (Focal). Click LAUNCH__.
Name your instance, select COMPUTE-OPTIMISED (default settings are recommended), enable HTTPS traffic under Firewall, and CREATE the VM instance.
SSH into the created VM instance.
Update your instance and install necessary packages: sudo apt-get update && sudo apt-get dist-upgrade -y && sudo apt-get install python3-pip -y && sudo apt-get install wget -y
Install the road_data_scraper Package: pip install road_data_scraper
Upload your GCP json credentials file.
Download the config.ini
file: wget https://raw.githubusercontent.com/dombean/road_data_scraper/main/src/road_data_scraper/config.ini
Download the runner.py
file: wget https://raw.githubusercontent.com/dombean/road_data_scraper/main/runner.py
Update the path to the config.ini
file in runner.py
.
Adjust the parameters in the config.ini
file as per your requirements. Refer to the README section on Adjusting the Config File for more information.
Run the Road Data Scraper Pipeline: python3 runner.py
Ensure Docker and Google Cloud SDK are installed locally. You will also need to authenticate Google Cloud and Docker.
gcloud auth login
gcloud config set project <project-name>
gcloud auth configure-docker
git clone https://github.com/dombean/road_data_scraper.git
cd road_data_scraper/
docker build -t road-data-scraper -f Dockerfile .
docker run -it --env PORT=80 -p 80:80 road-data-scraper
docker tag road-data-scraper eu.gcr.io/<project-name>/road-data-scraper
docker push eu.gcr.io/<project-name>/road-data-scraper
gcloud run deploy road-data-scraper --image eu.gcr.io/<project-name>/road-data-scraper --platform managed --region europe-west2 --timeout "3600" --cpu "4" --memory "16Gi" --max-instances "3"
FAQs
Scrapes and Cleans WebTRIS Traffic Flow API
We found that road-data-scraper demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Socket uncovered npm malware campaign mimicking popular Node.js libraries and packages from other ecosystems; packages steal data and execute remote code.
Research
Socket's research uncovers three dangerous Go modules that contain obfuscated disk-wiping malware, threatening complete data loss.
Research
Socket uncovers malicious packages on PyPI using Gmail's SMTP protocol for command and control (C2) to exfiltrate data and execute commands.