NLWeb Data Loading
Data loading tools for NLWeb - load schema.org JSON files and RSS feeds into vector databases with automatic embedding generation.
Overview
nlweb-dataload provides a simple interface for loading structured data into vector databases. It:
- Loads schema.org JSON files or RSS/Atom feeds
- Automatically computes embeddings for all documents
- Uploads to vector databases in batches
- Supports deletion by site
Installation
pip install nlweb-dataload
pip install -e packages/dataload
Quick Start
import asyncio
import nlweb_core
from nlweb_dataload import load_to_db, delete_site
nlweb_core.init(config_path="config.yaml")
async def main():
result = await load_to_db(
file_path="recipes.json",
site="seriouseats"
)
print(f"Loaded {result['total_loaded']} documents")
asyncio.run(main())
Configuration
Add writer configuration to your config.yaml:
retrieval_endpoints:
azure_search_prod:
db_type: azure_ai_search
api_endpoint: https://your-search.search.windows.net
api_key_env: AZURE_SEARCH_KEY
index_name: embeddings1536
auth_method: api_key
writer:
enabled: true
import_path: nlweb_azure_vectordb.azure_search_writer
class_name: AzureSearchWriter
write_endpoint: azure_search_prod
Usage
Load JSON File
Load a schema.org JSON file:
from nlweb_dataload import load_to_db
result = await load_to_db(
file_path="data/recipes.json",
site="seriouseats"
)
Example JSON file:
[
{
"@context": "http://schema.org",
"@type": "Recipe",
"url": "https://www.seriouseats.com/best-pasta-recipe",
"name": "Best Pasta Ever",
"description": "The best pasta recipe you'll ever make",
"author": {"@type": "Person", "name": "Chef Mario"}
}
]
Load an RSS or Atom feed (automatically converts to schema.org Article):
from nlweb_dataload import load_to_db
result = await load_to_db(
file_path="https://example.com/feed.xml",
site="example",
file_type="rss"
)
result = await load_to_db(
file_path="feeds/blog.xml",
site="myblog",
file_type="rss"
)
Delete Site Data
Remove all documents for a site:
from nlweb_dataload import delete_site
result = await delete_site(site="old-site.com")
print(f"Deleted {result['deleted_count']} documents")
Batch Upload
Control batch size for large datasets:
result = await load_to_db(
file_path="large_dataset.json",
site="example",
batch_size=50
)
Specify Endpoint
Use a specific endpoint instead of default write_endpoint:
result = await load_to_db(
file_path="data.json",
site="example",
endpoint_name="azure_search_staging"
)
Data Format
Schema.org JSON
Documents must include these fields:
url (required): Unique document URL
name or headline (required): Document name/title
description (optional): Used for embedding if present
Any valid schema.org type is supported (Recipe, Article, Product, Event, etc.).
RSS/Atom feeds are automatically converted to schema.org Article format with:
url: Entry link
name/headline: Entry title
description: Entry summary/content
datePublished: Publication date
author: Entry author
publisher: Feed title/link
keywords: Entry tags/categories
Architecture
Write Interface Separation
NLWeb maintains clean separation between read and write operations:
nlweb_core.retriever: Read-only search interface
nlweb_dataload.writer: Write interface (upload/delete)
This prevents accidental writes during queries and allows different access patterns.
Writer Interface
Each vector database provider implements VectorDBWriterInterface:
from nlweb_dataload.writer import VectorDBWriterInterface
class MyDatabaseWriter(VectorDBWriterInterface):
async def upload_documents(self, documents, **kwargs):
pass
async def delete_documents(self, filter_criteria, **kwargs):
pass
async def delete_site(self, site, **kwargs):
pass
Supported Vector Databases
Azure AI Search
Built-in support via nlweb-azure-vectordb:
pip install nlweb-azure-vectordb
Configuration:
retrieval_endpoints:
azure_search:
db_type: azure_ai_search
writer:
import_path: nlweb_azure_vectordb.azure_search_writer
class_name: AzureSearchWriter
Other Databases
Create a writer class for your database:
- Implement
VectorDBWriterInterface
- Add to config with
import_path and class_name
- Install provider package
See nlweb_azure_vectordb.azure_search_writer for reference implementation.
Command Line Usage
python -m nlweb_dataload.db_load \
--file data/recipes.json \
--site seriouseats \
--config config.yaml
python -m nlweb_dataload.db_load \
--file https://example.com/feed.xml \
--site example \
--type rss \
--config config.yaml
python -m nlweb_dataload.db_load \
--delete-site old-site.com \
--config config.yaml
Dependencies
nlweb-core>=0.5.0 - Core NLWeb functionality
feedparser>=6.0.0 - RSS/Atom feed parsing
aiohttp>=3.8.0 - Async HTTP for URL loading
Development
pip install -e "packages/dataload[dev]"
pytest packages/dataload/tests
License
MIT License - Copyright (c) 2025 Microsoft Corporation