TechCrunch Scraper
TechCrunch Scraper is a Node.js package that allows you to scrape articles from TechCrunch based on categories or tags. This package is designed for systems using Ubuntu or other Debian-based distributions that support sudo
commands, leveraging Puppeteer to navigate and scrape content from a headless Chromium environment.
Features
- Scrape by Category: Automatically retrieve all articles under a specified category.
- Scrape by Tag: Collect articles that are tagged with a specific keyword.
- Headless Browser Support: Runs Chromium in headless mode to scrape dynamic content.
- Optimized for Ubuntu: Includes installation instructions specifically for Ubuntu, but compatible with other Linux distributions.
Prerequisites
Before installing the TechCrunch Scraper, you need to ensure your system has the following dependencies installed:
- Node.js (Version 14 or later recommended)
- Puppeteer
- Dependencies required for Puppeteer and headless Chromium
Installation
Follow these steps to set up the TechCrunch Scraper package:
Step 1: Install System Dependencies
Open a terminal and execute the following commands to install necessary libraries:
sudo apt-get update
sudo apt-get install -y libgbm-dev xvfb chromium-browser libvpx7 libevent-2.1-7 libharfbuzz-icu0 libgstgl-1.0-0 libgstcodecparsers-1.0-0 libwebpdemux2 libenchant-2-2 libsecret-1-0 libmanette-0.2-0 libflite1 libx264-155 libgles2-mesa
npx playwright install
Xvfb :99 -screen 0 1920x1080x24 &
export DISPLAY=:99