
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
@anyparser/core
Advanced tools
The `@anyparser/core` Typescript SDK enables developers to quickly extract structured data from a wide variety of file formats like PDFs, images, websites, audio, and videos.
Unlock the potential of your AI models with Anyparser Core, the Typescript SDK designed for high-performance content extraction and format conversion. Built for developers, this SDK streamlines the process of acquiring clean, structured data from diverse sources, making it an indispensable tool for building cutting-edge applications in Retrieval Augmented Generation (RAG), Agentic AI, Generative AI, and robust ETL Pipelines.
Key Benefits for AI Developers:
Get Started Quickly:
npm install command.Before starting, add a new API key on the Anyparser Studio.
export ANYPARSER_API_URL=https://anyparserapi.com
export ANYPARSER_API_KEY=<your-api-key>
or
export ANYPARSER_API_URL=https://eu.anyparserapi.com
export ANYPARSER_API_KEY=<your-api-key>
npm install @anyparser/core
These examples demonstrate how to use Anyparser Core for common AI tasks, arranged from basic to advanced usage.
When you're just getting started or prototyping, you can use this simplified approach with minimal configuration:
import { Anyparser } from '@anyparser/core'
async function main () {
// Instantiate with default settings, assuming API credentials are
// set as environment variables.
console.log(await new Anyparser().parse('docs/sample.docx'))
}
main().catch(console.error)
This example showcases how to extract structured data from local files with full configuration, preparing them for indexing in a RAG system. The JSON output is ideal for vector databases and downstream AI processing. Perfect for building your initial knowledge base with high-quality, structured data.
import type { AnyparserOption, AnyparserResultBase } from '@anyparser/core'
import { Anyparser } from '@anyparser/core'
const multipleFiles = ['docs/sample.docx', 'docs/sample.pdf']
const options: AnyparserOption = {
apiUrl: new URL(process.env.ANYPARSER_API_URL ?? 'https://anyparserapi.com'),
apiKey: process.env.ANYPARSER_API_KEY,
format: 'json',
image: true,
table: true
}
const parser = new Anyparser(options)
async function main () {
const result = await parser.parse(multipleFiles) as AnyparserResultBase[]
for (const item of result) {
console.log('-'.repeat(100))
console.log('File:', item.originalFilename)
console.log('Checksum:', item.checksum)
console.log('Total characters:', item.totalCharacters)
console.log('Markdown:', item.markdown?.substring(0, 500))
}
console.log('-'.repeat(100))
}
main().catch(console.error)
Extract text from images and scanned documents using our advanced OCR capabilities. This example shows how to configure language and preset options for optimal results, particularly useful for processing historical documents, receipts, or any image-based content:
import type { AnyparserOption } from '@anyparser/core'
import { Anyparser, OCR_LANGUAGES } from '@anyparser/core'
const singleFile = 'docs/document.png'
const options: AnyparserOption = {
apiUrl: new URL(process.env.ANYPARSER_API_URL ?? 'https://anyparserapi.com'),
apiKey: process.env.ANYPARSER_API_KEY,
model: 'ocr',
format: 'markdown',
// ocrLanguage: ['eng'],
ocrLanguage: [OCR_LANGUAGES.JAPANESE],
ocrPreset: 'scan'
}
const parser = new Anyparser(options)
async function main () {
const result = await parser.parse(singleFile)
console.log(result)
}
main().catch(console.error)
Keep your knowledge base fresh with our powerful web crawling capabilities. This example shows how to crawl websites while respecting robots.txt directives and maintaining politeness delays:
import type { AnyparserCrawlResult, AnyparserOption, AnyparserUrl } from '@anyparser/core'
import { Anyparser } from '@anyparser/core'
const url = 'https://anyparser.com/docs'
const options: AnyparserOption = {
apiUrl: new URL(process.env.ANYPARSER_API_URL ?? 'https://anyparserapi.com'),
apiKey: process.env.ANYPARSER_API_KEY,
model: 'crawler',
format: 'json',
maxDepth: 50,
maxExecutions: 2,
strategy: 'LIFO',
traversalScope: 'subtree'
}
const parser = new Anyparser(options)
async function main () {
const result = await parser.parse(url) as AnyparserCrawlResult[]
for (const candidate of result) {
console.log('\n')
console.log('Start URL :', candidate.startUrl)
console.log('Total characters :', candidate.totalCharacters)
console.log('Total items :', candidate.totalItems)
console.log('Robots directive :', candidate.robotsDirective)
console.log('\n')
console.log('*'.repeat(100))
console.log('Begin Crawl')
console.log('*'.repeat(100))
console.log('\n')
const resources = candidate.items || []
for (let index = 0; index < resources.length; index++) {
const item = resources[index] as AnyparserUrl
if (index > 0) {
console.log('-'.repeat(100))
console.log('\n')
}
console.log('URL :', item.url)
console.log('Title :', item.title)
console.log('Status message :', item.statusMessage)
console.log('Total characters :', item.totalCharacters)
console.log('Politeness delay :', item.politenessDelay)
console.log('Content:\n')
console.log(item.markdown)
}
}
}
main().catch(console.error)
The Anyparser class defines the AnyparserOption interface for flexible configuration, allowing you to fine-tune the extraction process for different AI use cases.
export interface AnyparserOption {
apiUrl?: URL
apiKey?: string
format?: AnyparserFormatType
model?: AnyparserModelType
encoding?: AnyparserEncodingType
image?: boolean
table?: boolean
files?: string | string[]
ocrLanguage?: OcrLanguageType[]
ocrPreset?: OcrPresetType
url?: string
maxDepth?: number
maxExecutions?: number
strategy?: 'LIFO' | 'FIFO'
traversalScope?: 'subtree' | 'domain'
}
Key Configuration Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
apiUrl | string (optional) | undefined | API endpoint URL. Defaults to ANYPARSER_API_URL environment variable |
apiKey | string (optional) | undefined | API key for authentication. Defaults to ANYPARSER_API_KEY environment variable |
format | str | "json" | Output format: "json", "markdown", or "html" |
model | str | "text" | Processing model: "text", "ocr", "vlm", "lam", or "crawler" |
encoding | str | "utf-8" | Text encoding: "utf-8" or "latin1" |
image | boolean (optional) | undefined | Enable/disable image extraction |
table | boolean (optional) | undefined | Enable/disable table extraction |
files | string or string[] (optional) | undefined | Input files to process |
url | string (optional) | undefined | URL for crawler model |
ocrLanguage | OcrLanguageType[] (optional) | undefined | Languages for OCR processing |
ocrPreset | OcrPresetType (optional) | undefined | Preset configuration for OCR |
maxDepth | number (optional) | undefined | Maximum crawl depth for crawler model |
maxExecutions | number (optional) | undefined | Maximum number of pages to crawl |
strategy | string (optional) | undefined | Crawling strategy: "LIFO" or "FIFO" |
traversalScope | string (optional) | undefined | Crawling scope: "subtree" or "domain" |
OCR Presets:
The following OCR presets are available for optimized document processing:
OCRPreset.DOCUMENT - General document processingOCRPreset.HANDWRITING - Handwritten text recognitionOCRPreset.SCAN - Scanned document processingOCRPreset.RECEIPT - Receipt processingOCRPreset.MAGAZINE - Magazine/article processingOCRPreset.INVOICE - Invoice processingOCRPreset.BUSINESS_CARD - Business card processingOCRPreset.PASSPORT - Passport document processingOCRPreset.DRIVER_LICENSE - Driver's license processingOCRPreset.IDENTITY_CARD - ID card processingOCRPreset.LICENSE_PLATE - License plate recognitionOCRPreset.MEDICAL_REPORT - Medical document processingOCRPreset.BANK_STATEMENT - Bank statement processingOCR Language:
Model Types for AI Data Pipelines:
Select the appropriate processing model based on your AI application needs:
'text': Optimized for extracting textual content for language models and general text-based RAG.'ocr': Performs Optical Character Recognition to extract text from image-based documents, expanding your data sources for AI training and knowledge bases. Essential for processing scanned documents for RAG and Generative AI.'vlm': Utilizes a Vision-Language Model for advanced understanding of image content, enabling richer context for Generative AI and more sophisticated Agentic AI perception.'lam' (Coming Soon): Employs a Large-Audio Model for extracting insights from audio data, opening new possibilities for multimodal AI applications.'crawler': Enables website crawling to gather up-to-date information for dynamic AI knowledge bases and Agentic AI agents.OCR Configuration for Enhanced AI Data Quality (when model='ocr'):
Fine-tune OCR settings for optimal accuracy when processing image-based documents. This is critical for ensuring high-quality data for your AI models.
| Option | Type | Default | Description | Relevance for AI |
|---|---|---|---|---|
ocrLanguage | OcrLanguageType[] (optional) | undefined | List of ISO 639-2 language codes for OCR, ensuring accurate text extraction for multilingual documents. | Essential for accurate data extraction from documents in different languages for global AI. |
ocrPreset | string (optional) | undefined | Predefined configuration for specific document types to optimize OCR accuracy. | Use presets to improve accuracy for specific document types used in your AI workflows. |
Available OCR Presets for AI Data Preparation:
Leverage these presets for common document types used in AI datasets:
'document': General-purpose OCR for standard documents.'handwriting': Optimized for handwritten text, useful for digitizing historical documents or notes for AI analysis.'scan': For scanned documents and images.'receipt', 'magazine', 'invoice', 'business-card', 'passport', 'driver-license', 'identity-card', 'license-plate', 'medical-report', 'bank-statement'. These presets are crucial for building structured datasets for training specialized AI models or powering Agentic AI agents that interact with these document types.We welcome contributions to the Anyparser Core SDK, particularly those that enhance its capabilities for AI data preparation. Please refer to the Contribution Guidelines.
While Anyparser is already a powerful solution for document parsing, we’re committed to continually improving and expanding our platform. Our roadmap includes:
Apache-2.0
For technical support or inquiries related to using Anyparser Core for AI applications, please visit our Community Discussions. We are here to help you build the next generation of AI applications.
FAQs
The `@anyparser/core` Typescript SDK enables developers to quickly extract structured data from a wide variety of file formats like PDFs, images, websites, audio, and videos.
The npm package @anyparser/core receives a total of 7 weekly downloads. As such, @anyparser/core popularity was classified as not popular.
We found that @anyparser/core demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.