
Research
2025 Report: Destructive Malware in Open Source Packages
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.
nextjs-hydration-parser
Advanced tools
A Python library for extracting and parsing Next.js hydration data from HTML content
A specialized Python library for extracting and parsing Next.js 13+ hydration data from raw HTML pages. When scraping Next.js applications, the server-side rendered HTML contains complex hydration data chunks embedded in self.__next_f.push() calls that need to be properly assembled and parsed to access the underlying application data.
Next.js 13+ applications with App Router use a sophisticated hydration system that splits data across multiple script chunks in the raw HTML. When you scrape these pages (before JavaScript execution), you get fragments like:
<script>self.__next_f.push([1,"partial data chunk 1"])</script>
<script>self.__next_f.push([1,"continuation of data"])</script>
<script>self.__next_f.push([2,"{\"products\":[{\"id\":1,\"name\":\"Product\"}]}"])</script>
This data is:
This library solves these challenges by intelligently combining chunks, handling multiple encoding formats, and extracting the meaningful application data.
self.__next_f.push() callsPerfect for:
pip install nextjs-hydration-parser
chompjs for JavaScript object parsingrequests (for scraping examples)The library is lightweight with minimal dependencies, designed for integration into existing scraping pipelines.
from nextjs_hydration_parser import NextJSHydrationDataExtractor
import requests
# Create an extractor instance
extractor = NextJSHydrationDataExtractor()
# Scrape a Next.js page (before JavaScript execution)
response = requests.get('https://example-nextjs-ecommerce.com/products')
html_content = response.text
# Extract and parse the hydration data
chunks = extractor.parse(html_content)
# Process the results to find meaningful data
for chunk in chunks:
print(f"Chunk ID: {chunk['chunk_id']}")
for item in chunk['extracted_data']:
if item['type'] == 'colon_separated':
# Often contains API response data
print(f"API Data: {item['data']}")
elif 'products' in str(item['data']):
# Found product data
print(f"Products: {item['data']}")
# Extract product data from a Next.js e-commerce site
extractor = NextJSHydrationDataExtractor()
html_content = open('product_page.html', 'r').read()
chunks = extractor.parse(html_content)
# Find product information
products = extractor.find_data_by_pattern(chunks, 'product')
for product_data in products:
if isinstance(product_data['value'], dict):
product = product_data['value']
print(f"Product: {product.get('name', 'Unknown')}")
print(f"Price: ${product.get('price', 'N/A')}")
print(f"Stock: {product.get('inventory', 'Unknown')}")
For large pages with hundreds of chunks, use lightweight mode when you know what data you're looking for. This can be 10-15x faster than full parsing!
from nextjs_hydration_parser import NextJSHydrationDataExtractor
extractor = NextJSHydrationDataExtractor()
html_content = open('large_nextjs_page.html', 'r').read()
# Method 1: Full parsing (slow - 12+ seconds)
chunks = extractor.parse(html_content)
results = extractor.find_data_by_pattern(chunks, 'products')
# Method 2: Lightweight mode (fast - <1 second!) ⚡
results = extractor.parse_and_find(html_content, ['products'])
# Both methods return the same data, but lightweight is 14x faster!
print(f"Found {len(results)} matching items")
for result in results:
print(f"Key: {result['key']}")
print(f"Data: {result['value']}")
import time
# Full parsing
start = time.time()
full_chunks = extractor.parse(html_content)
full_results = extractor.find_data_by_pattern(full_chunks, 'product')
print(f"Full parsing: {time.time() - start:.2f}s") # ~12.4s
# Lightweight mode
start = time.time()
light_results = extractor.parse_and_find(html_content, ['product'])
print(f"Lightweight: {time.time() - start:.2f}s") # ~0.9s (14x faster!)
# Search for multiple patterns at once
patterns = ['products', 'listings', 'inventory', 'prices']
results = extractor.parse_and_find(html_content, patterns)
# Or use manual lightweight parsing for more control
chunks = extractor.parse(
html_content,
lightweight=True,
target_patterns=['listingsConnection', 'productData']
)
# Only chunks containing your patterns are fully parsed
for chunk in chunks:
if not chunk.get('_skipped'):
print(f"Chunk {chunk['chunk_id']} contains target data")
# Process extracted_data as usual
# Extract product listings from a large e-commerce category page
extractor = NextJSHydrationDataExtractor()
with open('ecommerce_category.html', 'r') as f:
html = f.read()
# Fast extraction using lightweight mode
results = extractor.parse_and_find(html, ['products', 'catalog', 'items'])
for result in results:
if result['key'] in ['products', 'catalog']:
data = result['value']
# Access product listings
if isinstance(data, list):
print(f"Found {len(data)} products")
for product in data[:5]: # Show first 5
print(f"- {product.get('name', 'N/A')}: ${product.get('price', 'N/A')}")
import requests
from nextjs_hydration_parser import NextJSHydrationDataExtractor
def scrape_nextjs_data(url):
"""Scrape and extract data from a Next.js application"""
# Get raw HTML (before JavaScript execution)
headers = {'User-Agent': 'Mozilla/5.0 (compatible; DataExtractor/1.0)'}
response = requests.get(url, headers=headers)
# Parse hydration data
extractor = NextJSHydrationDataExtractor()
chunks = extractor.parse(response.text)
# Extract meaningful data
extracted_data = {}
for chunk in chunks:
if chunk['chunk_id'] == 'error':
continue # Skip malformed chunks
for item in chunk['extracted_data']:
data = item['data']
# Look for common data patterns
if isinstance(data, dict):
# API responses often contain these keys
for key in ['products', 'users', 'posts', 'data', 'results']:
if key in data:
extracted_data[key] = data[key]
return extracted_data
# Usage
data = scrape_nextjs_data('https://nextjs-shop.example.com')
print(f"Found {len(data.get('products', []))} products")
When scraping large Next.js applications, use lightweight mode for better performance:
# Read from file
with open('large_nextjs_page.html', 'r', encoding='utf-8') as f:
html_content = f.read()
# Use lightweight mode when you know what you're looking for (RECOMMENDED)
extractor = NextJSHydrationDataExtractor()
results = extractor.parse_and_find(html_content, ['products', 'listings'])
# Or full parse if you need everything
chunks = extractor.parse(html_content)
print(f"Found {len(chunks)} hydration chunks")
# Get overview of all available data keys
all_keys = extractor.get_all_keys(chunks)
print("Most common data keys:")
for key, count in list(all_keys.items())[:20]:
print(f" {key}: {count} occurrences")
# Focus on specific data types
api_data = []
for chunk in chunks:
for item in chunk['extracted_data']:
if item['type'] == 'colon_separated' and 'api' in item.get('identifier', '').lower():
api_data.append(item['data'])
print(f"Found {len(api_data)} API data chunks")
NextJSHydrationDataExtractorThe main class for extracting Next.js hydration data.
parse(html_content: str, lightweight: bool = False, target_patterns: Optional[List[str]] = None) -> List[Dict[str, Any]]
Parse Next.js hydration data from HTML content.
html_content: Raw HTML string containing script tagslightweight: If True, only process chunks containing target patterns (much faster)target_patterns: List of strings to search for in lightweight mode (e.g., ["products", "listings"])parse_and_find(html_content: str, patterns: List[str]) -> List[Any] ⚡ RECOMMENDED
Convenience method that combines lightweight parsing with pattern matching. Much faster than full parsing when you know what you're looking for.
html_content: Raw HTML stringpatterns: List of key patterns to search for (e.g., ["products", "catalog", "items"])get_all_keys(parsed_chunks: List[Dict], max_depth: int = 3) -> Dict[str, int]
Extract all unique keys from parsed chunks.
parsed_chunks: Output from parse() methodmax_depth: Maximum depth to traversefind_data_by_pattern(parsed_chunks: List[Dict], pattern: str) -> List[Any]
Find data matching a specific pattern.
parsed_chunks: Output from parse() methodpattern: Key pattern to search forThe parser returns data in the following structure:
[
{
"chunk_id": "1", # ID from self.__next_f.push([ID, data])
"extracted_data": [
{
"type": "colon_separated|standalone_json|whole_text",
"data": {...}, # Parsed JavaScript/JSON object
"identifier": "...", # For colon_separated type
"start_position": 123 # For standalone_json type
}
],
"chunk_count": 1, # Number of chunks with this ID
"_positions": [123] # Original positions in HTML
}
]
The parser handles various data formats commonly found in Next.js 13+ hydration chunks:
self.__next_f.push([1, "{\"products\":[{\"id\":1,\"name\":\"Laptop\",\"price\":999}]}"])
self.__next_f.push([2, "eyJhcGlLZXkiOiJ4eXoifQ==:{\"data\":{\"users\":[{\"id\":1}]}}"])
self.__next_f.push([3, "{key: 'value', items: [1, 2, 3], nested: {deep: true}}"])
self.__next_f.push([4, "\"escaped content with \\\"quotes\\\" and newlines\\n\""])
// Data split across multiple chunks with same ID
self.__next_f.push([5, "first part of data"])
self.__next_f.push([5, " continued here"])
self.__next_f.push([5, " and final part"])
Next.js often embeds API responses, page props, and component data in deeply nested formats that the parser can extract and flatten for easy access.
Understanding the hydration process helps explain why this library is necessary:
self.__next_f.push() callsWhen scraping, you're intercepting step 2 - getting the raw HTML with embedded data before the JavaScript processes it. This gives you access to all the data the application uses, but in a fragmented format that needs intelligent parsing.
Why not just use the rendered page?
The parser includes robust error handling:
Contributions are welcome! Please feel free to submit a Pull Request.
git checkout -b feature/AmazingFeature)git commit -m 'Add some AmazingFeature')git push origin feature/AmazingFeature)# Clone the repository
git clone https://github.com/kennyaires/nextjs-hydration-parser.git
cd nextjs-hydration-parser
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode with testing dependencies
pip install -e .[dev]
# Run tests
pytest tests/ -v
# Run formatting
black nextjs_hydration_parser/ tests/
# Test with real Next.js sites
python examples/scrape_example.py
The library includes examples for testing with popular Next.js sites:
# Test with different types of Next.js applications
python examples/test_ecommerce.py
python examples/test_blog.py
python examples/test_social.py
This project is licensed under the MIT License - see the LICENSE file for details.
This project is not affiliated with or endorsed by Vercel, Next.js, or any related entity.
All trademarks and brand names are the property of their respective owners.
This library is intended for ethical use only. Users are solely responsible for ensuring that their use of this software complies with applicable laws, website terms of service, and data usage policies. The authors disclaim any liability for misuse or violations resulting from the use of this tool.
FAQs
A Python library for extracting and parsing Next.js hydration data from HTML content
We found that nextjs-hydration-parser demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.

Security News
Socket CTO Ahmad Nassri shares practical AI coding techniques, tools, and team workflows, plus what still feels noisy and why shipping remains human-led.

Research
/Security News
A five-month operation turned 27 npm packages into durable hosting for browser-run lures that mimic document-sharing portals and Microsoft sign-in, targeting 25 organizations across manufacturing, industrial automation, plastics, and healthcare for credential theft.