
Research
/Security News
Mini Shai-Hulud Campaign Hits Red Hat Cloud Services npm Packages
A mini Shai-Hulud campaign compromised Red Hat Cloud Services npm packages to steal developer and CI/CD secrets during installation.
filecrystal
Advanced tools
Universal file parser for PDFs, images, xlsx/xls, docx — outputs Markdown and prompt-defined JSON via any OpenAI-compatible API.
Universal file parser for PDFs, images, xlsx/xls and docx — with structured field extraction via any OpenAI-compatible API.
One consistent ParseResult for every supported file format, plus a
Markdown-first pipeline. Plug any OpenAI-compatible provider (OpenAI /
Moonshot / DeepSeek / 阿里百炼 / self-hosted vLLM) for OCR, seal detection and
prompt-driven field extraction — switching provider is a change of baseUrl +
model, not a code rewrite.
pnpm add filecrystal
# or
npm i filecrystal
Two focused subcommands: extract (files → Markdown) and structure
(Markdown / files → prompt-defined JSON).
# 1. Parse files to Markdown — defaults to writing next to each input
filecrystal extract ./a.pdf ./b.xlsx
# → ./a.md ./b.md
# Write to a dedicated directory
filecrystal extract ./*.pdf --out ./out/
# 2. Extract structured fields with a prompt (file or inline)
filecrystal structure ./out/a.md --prompt ./prompts/contract.prompt.md
filecrystal structure ./out/a.md --prompt-text '输出 JSON: {"title":"..."}'
Full option reference: docs/CLI.md.
# Default OCR/LLM backend: any OpenAI-compatible provider.
# Alibaba 百炼 (Qwen) is the default preset when you use DashScope.
export FILECRYSTAL_MODEL_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
export FILECRYSTAL_MODEL_API_KEY=sk-your-key-here
# Optional model overrides
export FILECRYSTAL_VISION_MODEL=qwen-vl-ocr-latest # OCR + seal detection
export FILECRYSTAL_TEXT_MODEL=qwen3.6-plus # structure stage
export FILECRYSTAL_VISION_MODEL_THINKING=false # Qwen3 reasoning for OCR
export FILECRYSTAL_TEXT_MODEL_THINKING=false # Qwen3 reasoning for structure
# Optional concurrency tuning
export FILECRYSTAL_FILE_CONCURRENCY=20 # CLI file-level parallelism (extract + structure)
export FILECRYSTAL_OCR_CONCURRENCY=24 # process-wide OCR / vision pool; lower if rate-limited
For OCR-only Markdown extraction you can use Aliyun OCR directly, without an OpenAI-compatible vision model:
export FILECRYSTAL_OCR_PROVIDER=aliyun-ocr
export FILECRYSTAL_ALIYUN_ACCESS_KEY_ID=your-access-key-id
export FILECRYSTAL_ALIYUN_ACCESS_KEY_SECRET=your-access-key-secret
filecrystal extract ./scan.pdf --out ./out/
Aliyun OCR uses RecognizeAdvanced with automatic rotation and table output on
by default, so scanned forms and payment tables can render as Markdown tables
without extra flags. filecrystal structure still needs text LLM credentials
(FILECRYSTAL_MODEL_BASE_URL + FILECRYSTAL_MODEL_API_KEY) because that stage
runs prompt-defined JSON extraction after Markdown is produced.
import {
createFileParser,
createStructuredExtractor,
parseMany,
toMarkdown,
toStructureSource,
} from 'filecrystal';
// --- Mock mode — works offline, deterministic placeholders ---
const parser = createFileParser({ mode: 'mock' });
const { raw, source } = await parser.parse('./contract.pdf');
const md = toMarkdown({ raw, source });
// --- API mode ---
const apiParser = createFileParser({
mode: 'api',
openai: {
baseUrl: process.env.FILECRYSTAL_MODEL_BASE_URL!,
apiKey: process.env.FILECRYSTAL_MODEL_API_KEY!,
models: { ocr: 'qwen-vl-ocr-latest', vision: 'qwen-vl-max', text: 'qwen3.6-plus' },
},
});
// --- Batch: many files concurrently ---
const batch = await parseMany(apiParser, ['./a.pdf', './b.xlsx'], { concurrency: 3 });
// --- Structured extraction: every source becomes Markdown text first,
// then one prompt bundles them in argv order (single LLM call by default). ---
const extractor = createStructuredExtractor({
mode: 'api',
openai: { /* same as above */ },
});
const sources = batch.items
.filter((i) => i.ok && i.result)
.map((i) => toStructureSource(i.result!)); // → { name, text } per source
const { extracted } = await extractor.extract(sources, {
prompt: customPromptMarkdown /* optional */,
});
| Format | Notes |
|---|---|
| xlsx / xls | SheetJS; cells, merges, formulas |
text-layer first, OCR fallback via @napi-rs/canvas + sharp preprocess | |
| jpg / png | sharp preprocessing (EXIF rotate, long-edge 2000, JPEG q85) → OCR |
| docx / doc | mammoth main + word-extractor fallback; embedded images scanned for seals |
interface ParseResult {
schemaVersion: '1.0';
parsedAt: string;
parserVersion: string;
source: ParsedSource; // filePath, fileName, fileFormat, fileHash, pageCount, ...
raw: ParsedRaw; // pages | sheets | sections | fullText | seals | signatures
extracted?: Record<string, unknown>; // only when options.prompt — prompt owns the schema
metrics: ParseMetrics; // quality / performance / cost (CNY)
warnings?: string[];
}
Full TypeScript contract: specs/001-file-parser/contracts/types.d.ts.
JSON Schema at runtime:
import { getParseResultJsonSchema } from 'filecrystal/schema';
console.log(getParseResultJsonSchema());
Both integration surfaces — CLI and SDK — produce the same output shape. Pick either for your workflow.
| Surface | When to use | Demo |
|---|---|---|
| CLI | shell scripts · CI pipelines · language-agnostic integrations | examples/cli-workflow.sh |
| SDK | Node.js apps · custom pre/post-processing · tight error handling | examples/sdk-workflow.mjs |
Both walk through the same two stages: extract → Markdown, then
structure → prompt-defined JSON. See examples/README.md
for the quick-start commands.
pnpm install
pnpm build
pnpm test
pnpm typecheck
pnpm lint
Spec-driven docs live in specs/001-file-parser/.
Contributing guide: CONTRIBUTING.md.
MIT — see LICENSE.
FAQs
Universal file parser for PDFs, images, xlsx/xls, docx — outputs Markdown and prompt-defined JSON via any OpenAI-compatible API.
The npm package filecrystal receives a total of 17 weekly downloads. As such, filecrystal popularity was classified as not popular.
We found that filecrystal demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
/Security News
A mini Shai-Hulud campaign compromised Red Hat Cloud Services npm packages to steal developer and CI/CD secrets during installation.

Research
/Security News
The North Korean malware loader hides in a Packagist-listed package and its GitHub branch to fetch and execute remote code in a likely Contagious Interview-style lure.

Security News
The Rust project is moving toward formal rules on LLM use in contributions after months of internal debate over maintainer burden, code quality, and contributor experience.