Big News: Socket raises $60M Series C at a $1B valuation to secure software supply chains for AI-driven development.Announcement →

filecrystal

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

filecrystal

Universal file parser for PDFs, images, xlsx/xls, docx — outputs Markdown and prompt-defined JSON via any OpenAI-compatible API.

latest

Source

npm

Version: 0.5.3

Version published: last month

Weekly downloads: 20

Maintainers: 1

Weekly downloads

Created: last month

Source

filecrystal

Universal file parser for PDFs, images, xlsx/xls and docx — with structured field extraction via any OpenAI-compatible API.

Why

One consistent ParseResult for every supported file format, plus a Markdown-first pipeline. Plug any OpenAI-compatible provider (OpenAI / Moonshot / DeepSeek / 阿里百炼 / self-hosted vLLM) for OCR, seal detection and prompt-driven field extraction — switching provider is a change of baseUrl + model, not a code rewrite.

Install

pnpm add filecrystal
# or
npm i filecrystal

Quick start (CLI)

Two focused subcommands: extract (files → Markdown) and structure (Markdown / files → prompt-defined JSON).

# 1. Parse files to Markdown — defaults to writing next to each input
filecrystal extract ./a.pdf ./b.xlsx
# → ./a.md  ./b.md

# Write to a dedicated directory
filecrystal extract ./*.pdf --out ./out/

# 2. Extract structured fields with a prompt (file or inline)
filecrystal structure ./out/a.md --prompt ./prompts/contract.prompt.md
filecrystal structure ./out/a.md --prompt-text '输出 JSON: {"title":"..."}'

Full option reference: docs/CLI.md.

Credentials

# Default OCR/LLM backend: any OpenAI-compatible provider.
# Alibaba 百炼 (Qwen) is the default preset when you use DashScope.
export FILECRYSTAL_MODEL_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
export FILECRYSTAL_MODEL_API_KEY=sk-your-key-here

# Optional model overrides
export FILECRYSTAL_VISION_MODEL=qwen-vl-ocr-latest        # OCR + seal detection
export FILECRYSTAL_TEXT_MODEL=qwen3.6-plus                # structure stage
export FILECRYSTAL_VISION_MODEL_THINKING=false            # Qwen3 reasoning for OCR
export FILECRYSTAL_TEXT_MODEL_THINKING=false              # Qwen3 reasoning for structure

# Optional concurrency tuning
export FILECRYSTAL_FILE_CONCURRENCY=20   # CLI file-level parallelism (extract + structure)
export FILECRYSTAL_OCR_CONCURRENCY=24    # process-wide OCR / vision pool; lower if rate-limited

Aliyun OCR provider

For OCR-only Markdown extraction you can use Aliyun OCR directly, without an OpenAI-compatible vision model:

export FILECRYSTAL_OCR_PROVIDER=aliyun-ocr
export FILECRYSTAL_ALIYUN_ACCESS_KEY_ID=your-access-key-id
export FILECRYSTAL_ALIYUN_ACCESS_KEY_SECRET=your-access-key-secret

filecrystal extract ./scan.pdf --out ./out/

Aliyun OCR uses RecognizeAdvanced with automatic rotation and table output on by default, so scanned forms and payment tables can render as Markdown tables without extra flags. filecrystal structure still needs text LLM credentials (FILECRYSTAL_MODEL_BASE_URL + FILECRYSTAL_MODEL_API_KEY) because that stage runs prompt-defined JSON extraction after Markdown is produced.

Quick start (library)

import {
  createFileParser,
  createStructuredExtractor,
  parseMany,
  toMarkdown,
  toStructureSource,
} from 'filecrystal';

// --- Mock mode — works offline, deterministic placeholders ---
const parser = createFileParser({ mode: 'mock' });
const { raw, source } = await parser.parse('./contract.pdf');
const md = toMarkdown({ raw, source });

// --- API mode ---
const apiParser = createFileParser({
  mode: 'api',
  openai: {
    baseUrl: process.env.FILECRYSTAL_MODEL_BASE_URL!,
    apiKey: process.env.FILECRYSTAL_MODEL_API_KEY!,
    models: { ocr: 'qwen-vl-ocr-latest', vision: 'qwen-vl-max', text: 'qwen3.6-plus' },
  },
});

// --- Batch: many files concurrently ---
const batch = await parseMany(apiParser, ['./a.pdf', './b.xlsx'], { concurrency: 3 });

// --- Structured extraction: every source becomes Markdown text first,
//     then one prompt bundles them in argv order (single LLM call by default). ---
const extractor = createStructuredExtractor({
  mode: 'api',
  openai: { /* same as above */ },
});
const sources = batch.items
  .filter((i) => i.ok && i.result)
  .map((i) => toStructureSource(i.result!));  // → { name, text } per source
const { extracted } = await extractor.extract(sources, {
  prompt: customPromptMarkdown /* optional */,
});

Supported formats

Format	Notes
xlsx / xls	SheetJS; cells, merges, formulas
pdf	text-layer first, OCR fallback via `@napi-rs/canvas` + `sharp` preprocess
jpg / png	`sharp` preprocessing (EXIF rotate, long-edge 2000, JPEG q85) → OCR
docx / doc	`mammoth` main + `word-extractor` fallback; embedded images scanned for seals

Output shape (library)

interface ParseResult {
  schemaVersion: '1.0';
  parsedAt: string;
  parserVersion: string;
  source: ParsedSource;       // filePath, fileName, fileFormat, fileHash, pageCount, ...
  raw: ParsedRaw;             // pages | sheets | sections | fullText | seals | signatures
  extracted?: Record<string, unknown>;  // only when options.prompt — prompt owns the schema
  metrics: ParseMetrics;      // quality / performance / cost (CNY)
  warnings?: string[];
}

Full TypeScript contract: specs/001-file-parser/contracts/types.d.ts. JSON Schema at runtime:

import { getParseResultJsonSchema } from 'filecrystal/schema';
console.log(getParseResultJsonSchema());

Integration examples

Both integration surfaces — CLI and SDK — produce the same output shape. Pick either for your workflow.

Surface	When to use	Demo
CLI	shell scripts · CI pipelines · language-agnostic integrations	`examples/cli-workflow.sh`
SDK	Node.js apps · custom pre/post-processing · tight error handling	`examples/sdk-workflow.mjs`

Both walk through the same two stages: extract → Markdown, then structure → prompt-defined JSON. See examples/README.md for the quick-start commands.

Development

pnpm install
pnpm build
pnpm test
pnpm typecheck
pnpm lint

Spec-driven docs live in specs/001-file-parser/. Contributing guide: CONTRIBUTING.md.

License

MIT — see LICENSE.

Keywords

FAQs

What is filecrystal?

Is filecrystal popular?

Is filecrystal well maintained?

Package last updated on 28 Apr 2026

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

filecrystal

filecrystal

Why

Install

Quick start (CLI)

Credentials

Aliyun OCR provider

Quick start (library)

Supported formats

Output shape (library)

Integration examples

Development

License

Keywords

Related posts

Famous Chollima Targets PHP Developers Through Compromised Packagist Package

Rust Moves to Restrict LLM Use in Contributions After Months of Internal Debate