
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
A tool that generates content files from website routes in multiple formats (text, JSON, markdown)
A Node.js web application that 'scoops' content from a series of routes for a specified website or API. ScoopIt fetches content and saves it in multiple formats (text, JSON, and markdown) with comprehensive testing and logging capabilities. It supports both HTML web pages and JSON API responses.
ScoopIt was designed to simplify and enhance AI development workflows, particularly those involving Large Language Models (LLMs). By providing clean, structured data extraction from websites, ScoopIt solves several key challenges in AI application development:
One of the most powerful applications for ScoopIt is in building domain-specific chatbots and assistants that need expertise in particular subjects. For example:
Case Study: Legal Compliance Assistant A legal tech company needed to build an assistant that could answer questions about regulatory compliance across multiple jurisdictions. Using ScoopIt, they:
This approach resulted in an assistant that could accurately answer complex compliance questions with 87% higher accuracy than using generic web content, while providing proper citations to source material.
Case Study: Healthcare Research Tool A medical research organization created a tool to help doctors stay current with the latest research:
By using ScoopIt to clean and structure the data before sending it to the LLM, the system achieved significantly higher accuracy in medical domain knowledge.
ScoopIt intelligently extracts and formats content from websites and APIs, ensuring that LLMs receive clean, structured data without irrelevant UI elements, navigation menus, footers, or other noise that might confuse the model or waste token context windows.
By providing content in multiple formats (text, JSON, markdown), ScoopIt gives you options for how to best present information to an LLM:
Instead of manually downloading individual pages, ScoopIt can process an entire site or specific sections based on your routes configuration, making it efficient to gather comprehensive information for training or context. This batch processing capability can save days of manual work in large-scale AI projects.
By extracting only relevant content and removing boilerplate HTML, ScoopIt helps you maximize the value of limited context windows in LLMs, focusing on the information that matters rather than wasting tokens on page structure.
This approach enables more accurate, relevant responses from your AI applications by providing clean, structured data tailored to your specific domain.
ScoopIt can be used programmatically in your Node.js applications:
const scoopit = require('scoopit');
// Process a single web page
async function processSinglePage() {
try {
const result = await scoopit.processSinglePage('https://example.com/page', 'json');
console.log(`Page processed: ${result.url}`);
console.log(`Content length: ${result.data.textContent.length} characters`);
} catch (error) {
console.error('Error processing page:', error);
}
}
// Process multiple routes from a website
async function processMultipleRoutes() {
const baseUrl = 'https://example.com';
const routes = ['/about', '/contact', '/products'];
try {
const results = await scoopit.processRoutes(baseUrl, routes, 'all');
console.log(`Processed ${results.length} routes successfully`);
} catch (error) {
console.error('Error processing routes:', error);
}
}
// Extract content without saving files
async function extractContentOnly() {
try {
const content = await scoopit.fetchContent('https://example.com');
if (content) {
// You can use other utilities from the library to process content
// without saving files
const { extractContent, convertToMarkdown } = require('scoopit/utils/contentProcessor');
const { textContent } = extractContent(content);
console.log('Extracted text content:', textContent);
}
} catch (error) {
console.error('Error extracting content:', error);
}
}
You can run ScoopIt directly without installation using npx:
# Run in interactive mode
npx scoopit
# Process a specific URL
npx scoopit https://example.com
# Process a URL with a specific format
npx scoopit https://example.com json
# Process routes from a routes.json file
npx scoopit routes.json
# Process routes with a custom base URL
npx scoopit routes.json all https://example.com
# Specify a custom routes file path
npx scoopit -routePath ./path/to/custom-routes.json
You can also install ScoopIt using your preferred package manager:
# Local installation
npm install scoopit
# Global installation
npm install -g scoopit
# Local installation
yarn add scoopit
# Global installation
yarn global add scoopit
# Local installation
pnpm add scoopit
# Global installation
pnpm add -g scoopit
Global installation makes the scoopit command available system-wide.
To get the latest development version directly from GitHub:
# Using npm
npm install -g github:yourusername/scoopit
# Using yarn
yarn global add github:yourusername/scoopit
# Using pnpm
pnpm add -g github:yourusername/scoopit
If you want to contribute, modify the code, or run from source:
git clone https://github.com/yourusername/scoopit.git
cd scoopit
# Using npm
npm install
# Using yarn
yarn install
# Using pnpm
pnpm install
npm run samples
# or
yarn samples
# or
pnpm run samples
# Run the CLI directly
node cli.js
# Run with default options
npm start
# or
yarn start
# or
pnpm start
# Run in development mode (auto-restart on file changes)
npm run dev
# or
yarn dev
# or
pnpm run dev
# Using npm
npm install -g .
# Using yarn
yarn global add file:$PWD
# Using pnpm
pnpm link --global
After global installation, you can use the scoopit command from anywhere.
ScoopIt provides various command-line options to control its behavior:
| Option | Description | Example |
|---|---|---|
[url] | Process a specific URL | scoopit https://example.com |
[format] | Output format (text, json, markdown, all) | scoopit https://example.com json |
[file.json] | JSON file containing routes to process | scoopit routes.json |
-routePath | Path to a custom routes file | scoopit -routePath ./custom-routes.json |
[baseUrl] | Base URL for routes (with routes.json) | scoopit routes.json all https://example.com |
Additional options for environment variables:
| Environment Variable | Description | Values |
|---|---|---|
LOG_LEVEL | Controls logging verbosity | error, warn, info (default), debug |
SCOOPIT_VERBOSE | Enable verbose output for tests | true, false |
NODE_ENV | Application environment | test, development, production |
Run ScoopIt without installation using npx:
# Interactive mode
npx scoopit
# With URL and format
npx scoopit https://example.com json
# With custom routes file
npx scoopit -routePath ./custom-routes.json
# Run a specific version
npx scoopit@1.0.0
# Run the latest beta version
npx scoopit@beta
The interactive CLI provides the easiest way to use the application:
npm run cli
# or if installed globally
scoopit
# or without installation
npx scoopit
Run the application with default settings:
npm start
# or
npx scoopit
You can specify a custom base URL, routes, and format:
# Using npm script
npm start -- https://example.com json
# Using global installation
scoopit https://example.com json
# Using npx
npx scoopit https://example.com json
# With a routes file
scoopit routes.json all https://example.com
# With a custom routes file path
scoopit -routePath ./custom-routes.json all https://example.com
ScoopIt includes several ASCII art banners for ICJIA (Illinois Criminal Justice Information Authority) that are displayed when the application starts:
# List all available banners with their IDs
npm run banners
# Or with yarn
yarn banners
# Or with pnpm
pnpm run banners
You can specify which banner to display using the --banner flag:
# Use the 'block' style banner
scoopit --banner block
# With npx
npx scoopit --banner shadow
# Combined with other arguments
scoopit --banner thin https://example.com json
If you prefer not to see the ICJIA banner, use the --no-icjia flag:
scoopit --no-icjia
To add your own ASCII art banners, edit the src/ui/console-banner.js file and add a new entry to the banners array:
{
id: 'my-custom',
name: 'My Custom Banner',
art: `
+-+-+-+-+
|I|C|J|I|A|
+-+-+-+-+
`,
color: 'green',
tags: ['custom', 'small']
}
ScoopIt uses a routes.json file to determine which routes to process. The application handles routes configuration with the following logic:
routes.json in the project rootroutes.json is present but empty, it defaults to a single route ('/')--routePath flag can specify a custom path for routes.jsonroutes.json is found, it falls back to default routes defined in the codeExample routes.json file:
[
"/",
"/about",
"/products",
"/contact"
]
For development with auto-restart on file changes:
npm run dev
All generated files are saved in the output directory in the current working directory, organized as follows:
output/json/ - JSON files containing the full URL, route, and content in both text and markdown formatsoutput/text/ - Plain text content filesoutput/markdown/ - Markdown content filesScoopIt includes a comprehensive testing suite that ensures the application works correctly across all formats and configurations.
To run the complete test suite with default verbosity:
npm test
# Run only unit tests
npm run test:unit
# Run only integration tests
npm run test:integration
# Run only validation tests
npm run test:validation
# Run tests with detailed output
npm run test:verbose
# Run tests with minimal output
npm run test:quiet
The test suite includes static sample files for testing in the test/samples directory, categorized by format:
test/samples/text/ - Plain text samplestest/samples/json/ - JSON data samplestest/samples/markdown/ - Markdown content samplesTo update or regenerate test samples:
npm run samples
The application includes a sophisticated logging system that provides detailed insights into the process:
logs directory:
combined.log - All logserror.log - Error logs onlyLog levels can be configured by setting the LOG_LEVEL environment variable to one of:
errorwarninfo (default)debugExample:
LOG_LEVEL=debug npm start
Contributions are welcome! Please feel free to submit a Pull Request.
ScoopIt includes a comprehensive testing system with multiple options for running tests, from full test suites to individual component tests.
To run the full test suite with all test types:
# Run all tests with standard output
npm run test:all
# Run all tests with verbose output
npm run test:all-verbose
You can run specific types of tests:
# Run only unit tests
npm run test:unit
# Run only integration tests
npm run test:integration
# Run only output validation tests
npm run test:validation
# Run live tests against a real website
npm run test:live
# Run live tests with a specific site
npm run test:live-site
ScoopIt provides an enhanced test runner with improved output formatting, progress indicators, and comprehensive statistics:
# Run enhanced test runner with all tests
node scripts/enhancedTestRunner.js
# Run with specific test types
node scripts/enhancedTestRunner.js --unit-only
node scripts/enhancedTestRunner.js --integration-only
node scripts/enhancedTestRunner.js --validation-only
# Run with a specific test site
node scripts/enhancedTestRunner.js --test-site=https://example.com
# Skip content validation (only check file existence)
node scripts/enhancedTestRunner.js --skip-validation
For quick verification of output files without running full tests, use the file existence checker:
# Check if output files exist and have content
node scripts/fileExistenceChecker.js
This tool will check the output directory for files in all formats (text, JSON, markdown) and verify that files exist and have content, without validating the specific content.
ScoopIt also includes an interactive test runner that allows you to choose which tests to run:
# Run the interactive test selector
npm run test:select
# Run specific tests directly
npm run test:select -- unit
npm run test:select -- integration
npm run test:select -- live
# Run multiple test types
npm run test:select -- unit integration
The test runner will display a menu of available tests, execute your selection, and provide detailed results with statistics for each test.
You can also run ScoopIt using Docker.
Docker offers several advantages for running ScoopIt:
Environment Isolation: Docker containers include all necessary dependencies without affecting your system. This eliminates "works on my machine" problems and potential conflicts with other Node.js versions or packages.
Consistent Execution: The containerized environment ensures ScoopIt runs the same way regardless of the host operating system (Windows, macOS, Linux).
No Node.js Requirement: You don't need to install Node.js on your host system, making deployment easier on servers or machines where you don't want to manage Node.js installations.
Simplified CI/CD Integration: Docker containers are easy to integrate into continuous integration and deployment pipelines, allowing automated content extraction as part of your workflows.
Resource Control: Docker allows you to limit CPU and memory usage, which is useful when running ScoopIt on shared servers or in production environments.
Easy Distribution: You can share your configured ScoopIt container with team members who can run it without worrying about installation or configuration steps.
Multiple Version Support: You can run different versions of ScoopIt in different containers without conflicts.
Scheduled Tasks: When combined with container orchestration tools, you can easily schedule ScoopIt to run content extraction jobs at regular intervals.
The repository includes a Dockerfile ready for use:
docker build -t scoopit .
# Run in interactive mode
docker run -it --rm -v "$(pwd)/output:/app/output" scoopit
# Run with specific arguments
docker run --rm -v "$(pwd)/output:/app/output" scoopit https://example.com json
# Use a custom routes file
docker run --rm \
-v "$(pwd)/output:/app/output" \
-v "$(pwd)/my-routes.json:/app/routes.json" \
scoopit
The output will be available in the output directory on your host machine.
The repository also includes a docker-compose.yml file for easier deployment:
docker-compose up
docker-compose.yml file:# Uncomment and modify this line in docker-compose.yml
command: ["https://example.com", "json"]
volumes:
- ./output:/app/output
- ./my-custom-routes.json:/app/routes.json
If you want to create your own Docker setup, use this Dockerfile as a template:
FROM node:18-alpine
WORKDIR /app
# Copy package files
COPY package*.json ./
# Install dependencies
RUN npm install --production
# Copy project files
COPY . .
# Make CLI executable
RUN chmod +x ./cli.js
# Create output directory
RUN mkdir -p /app/output
# Set entrypoint to the CLI
ENTRYPOINT ["./cli.js"]
Here's a practical example of how to use Docker for automated content extraction:
routes.json file with the routes you want to scrape:[
"/",
"/about",
"/products",
"/blog"
]
run-extraction.sh):#!/bin/bash
# Set timestamp for this run
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
OUTPUT_DIR="./content_archives/$TIMESTAMP"
# Create the output directory
mkdir -p $OUTPUT_DIR
# Run the container with the current timestamp in the output path
docker run --rm \
-v "$(pwd)/routes.json:/app/routes.json" \
-v "$OUTPUT_DIR:/app/output" \
scoopit https://example.com all
echo "Content extraction completed at $TIMESTAMP"
echo "Files saved to $OUTPUT_DIR"
chmod +x run-extraction.sh
crontab -e):# Run content extraction every Sunday at 2am
0 2 * * 0 /path/to/run-extraction.sh >> /path/to/extraction.log 2>&1
This setup automatically extracts content from your specified routes every week, organizing the output in timestamped directories for easy archiving and version comparison.
This project is licensed under the MIT License - see the LICENSE file for details.
FAQs
A tool that generates content files from website routes in multiple formats (text, JSON, markdown)
The npm package scoopit receives a total of 6 weekly downloads. As such, scoopit popularity was classified as not popular.
We found that scoopit demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.