
Security News
Axios Supply Chain Attack Reaches OpenAI macOS Signing Pipeline, Forces Certificate Rotation
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.
AST-aware code chunking for semantic search and RAG pipelines.
Uses tree-sitter to split source code at semantic boundaries (functions, classes, methods) rather than arbitrary character limits. Each chunk includes rich context: scope chain, imports, siblings, and entity signatures.
Traditional text splitters chunk code by character count or line breaks, often cutting functions in half or separating related code. code-chunk takes a different approach:
Source code is parsed into an Abstract Syntax Tree (AST) using tree-sitter. This gives us a structured representation of the code that understands language grammar.
We traverse the AST to extract semantic entities: functions, methods, classes, interfaces, types, and imports. For each entity, we capture:
async getUser(id: string): Promise<User>)Entities are organized into a hierarchical scope tree that captures nesting relationships. A method inside a class knows its parent; a nested function knows its containing function. This enables us to provide scope context like UserService > getUser.
Code is split at semantic boundaries while respecting the maxChunkSize limit. The chunker:
Each chunk is enriched with contextual metadata:
This context is formatted into contextualizedText, optimized for embedding models to understand semantic relationships.
bun add code-chunk
# or
npm install code-chunk
import { chunk } from 'code-chunk'
const chunks = await chunk('src/user.ts', sourceCode)
for (const c of chunks) {
console.log(c.text)
console.log(c.context.scope) // [{ name: 'UserService', type: 'class' }]
console.log(c.context.entities) // [{ name: 'getUser', type: 'method', ... }]
}
Use contextualizedText for better embedding quality in RAG systems:
for (const c of chunks) {
const embedding = await embed(c.contextualizedText)
await vectorDB.upsert({
id: `${filepath}:${c.index}`,
embedding,
metadata: { filepath, lines: c.lineRange }
})
}
The contextualizedText prepends semantic context to the raw code:
# src/services/user.ts
# Scope: UserService
# Defines: async getUser(id: string): Promise<User>
# Uses: Database
# After: constructor
async getUser(id: string): Promise<User> {
return this.db.query('SELECT * FROM users WHERE id = ?', [id])
}
Process chunks incrementally without loading everything into memory:
import { chunkStream } from 'code-chunk'
for await (const c of chunkStream('src/large.ts', code)) {
await process(c)
}
Create a chunker instance when processing multiple files with the same config:
import { createChunker } from 'code-chunk'
const chunker = createChunker({
maxChunkSize: 2048,
contextMode: 'full',
siblingDetail: 'signatures',
})
for (const file of files) {
const chunks = await chunker.chunk(file.path, file.content)
}
Process multiple files concurrently with error handling per file:
import { chunkBatch } from 'code-chunk'
const files = [
{ filepath: 'src/user.ts', code: userCode },
{ filepath: 'src/auth.ts', code: authCode },
{ filepath: 'lib/utils.py', code: utilsCode },
]
const results = await chunkBatch(files, {
maxChunkSize: 1500,
concurrency: 10,
onProgress: (done, total, path, success) => {
console.log(`[${done}/${total}] ${path}: ${success ? 'ok' : 'failed'}`)
}
})
for (const result of results) {
if (result.error) {
console.error(`Failed: ${result.filepath}`, result.error)
} else {
await indexChunks(result.filepath, result.chunks)
}
}
Stream results as they complete:
import { chunkBatchStream } from 'code-chunk'
for await (const result of chunkBatchStream(files, { concurrency: 5 })) {
if (result.chunks) {
await indexChunks(result.filepath, result.chunks)
}
}
For Effect-based pipelines:
import { chunkStreamEffect } from 'code-chunk'
import { Effect, Stream } from 'effect'
const program = Stream.runForEach(
chunkStreamEffect('src/utils.ts', code),
(chunk) => Effect.log(chunk.text)
)
await Effect.runPromise(program)
chunk(filepath, code, options?)Chunk source code into semantic pieces with context.
Parameters:
filepath: File path (used for language detection)code: Source code stringoptions: Optional configurationReturns: Promise<Chunk[]>
Throws: ChunkingError, UnsupportedLanguageError
chunkStream(filepath, code, options?)Stream chunks as they're generated. Useful for large files.
Returns: AsyncGenerator<Chunk>
Note: chunk.totalChunks is -1 in streaming mode (unknown upfront).
chunkStreamEffect(filepath, code, options?)Effect-native streaming API for composable pipelines.
Returns: Stream.Stream<Chunk, ChunkingError | UnsupportedLanguageError>
createChunker(options?)Create a reusable chunker instance with default options.
Returns: Chunker with chunk(), stream(), chunkBatch(), and chunkBatchStream() methods
chunkBatch(files, options?)Process multiple files concurrently with per-file error handling.
Parameters:
files: Array of { filepath, code, options? }options: Batch options (extends ChunkOptions with concurrency and onProgress)Returns: Promise<BatchResult[]> where each result has { filepath, chunks, error }
chunkBatchStream(files, options?)Stream batch results as files complete processing.
Returns: AsyncGenerator<BatchResult>
chunkBatchEffect(files, options?)Effect-native batch processing.
Returns: Effect.Effect<BatchResult[], never>
chunkBatchStreamEffect(files, options?)Effect-native streaming batch processing.
Returns: Stream.Stream<BatchResult, never>
formatChunkWithContext(text, context, overlapText?)Format chunk text with semantic context prepended. Useful for custom embedding pipelines.
Returns: string
detectLanguage(filepath)Detect programming language from file extension.
Returns: Language | null
| Option | Type | Default | Description |
|---|---|---|---|
maxChunkSize | number | 1500 | Maximum chunk size in bytes |
contextMode | 'none' | 'minimal' | 'full' | 'full' | How much context to include |
siblingDetail | 'none' | 'names' | 'signatures' | 'signatures' | Level of sibling detail |
filterImports | boolean | false | Filter out import statements |
language | Language | auto | Override language detection |
overlapLines | number | 10 | Lines from previous chunk to include in contextualizedText |
Extends ChunkOptions with:
| Option | Type | Default | Description |
|---|---|---|---|
concurrency | number | 10 | Maximum files to process concurrently |
onProgress | function | - | Callback (completed, total, filepath, success) => void |
| Language | Extensions |
|---|---|
| TypeScript | .ts, .tsx, .mts, .cts |
| JavaScript | .js, .jsx, .mjs, .cjs |
| Python | .py, .pyi |
| Rust | .rs |
| Go | .go |
| Java | .java |
ChunkingError: Thrown when chunking fails (parsing error, extraction error, etc.)
UnsupportedLanguageError: Thrown when the file extension is not supported
Both errors have a _tag property for Effect-style error handling.
MIT
FAQs
AST-aware code chunking for semantic search and RAG
We found that code-chunk demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
OpenAI rotated macOS signing certificates after a malicious Axios package reached its CI pipeline in a broader software supply chain attack.

Security News
Open source is under attack because of how much value it creates. It has been the foundation of every major software innovation for the last three decades. This is not the time to walk away from it.

Security News
Socket CEO Feross Aboukhadijeh breaks down how North Korea hijacked Axios and what it means for the future of software supply chain security.