code-tokenizer-md
for me but you can use it too
Push the limits of possible
Quick Start
$ cd your-git-repo
$ npx code-tokenizer-md
Overview
code-tokenizer-md
is a tool that processes git repository files, cleans code, redacts sensitive information, and generates markdown documentation with token counts using the Llama 3 tokenizer.
Philosophy
Human-first technologies for a better tomorrow.
graph TD
Start[Start] -->|Read| Git[Git Files]
Git -->|Clean| TC[TokenCleaner]
TC -->|Redact| Clean[Clean Code]
Clean -->|Generate| MD[Markdown]
MD -->|Count| Results[Token Counts]
style Start fill:#000000,stroke:#FFFFFF,stroke-width:4px,color:#ffffff
style Git fill:#222222,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style TC fill:#333333,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style Clean fill:#444444,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style MD fill:#555555,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style Results fill:#666666,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
Features
Data Processing
- Reads tracked files from git repository
- Removes comments, imports, and unnecessary whitespace
- Redacts sensitive information (API keys, tokens, JWT, hashes)
- Counts tokens using llama3-tokenizer-js
- Supports nested .code-tokenizer-md-ignore files
Token Cleaning
- Removes single-line and multi-line comments
- Strips console.log statements
- Removes import statements
- Cleans up whitespace and empty lines
Security Features
- Redacts API keys and secrets
- Masks JWT tokens
- Hides authorization tokens
- Redacts Base64 encoded strings
- Masks cryptographic hashes
Requirements
- Node.js (>=14.0.0)
- Git repository
- Bun runtime (for development)
Installation
npm install code-tokenizer-md
Usage
CLI
npx code-tokenizer-md
Programmatic Usage
import { MarkdownGenerator } from 'code-tokenizer-md';
const generator = new MarkdownGenerator({
dir: './project',
outputFilePath: './output.md',
verbose: true
});
const result = await generator.createMarkdownDocument();
Configuration
MarkdownGenerator Options
interface MarkdownGeneratorOptions {
dir?: string;
outputFilePath?: string;
fileTypeExclusions?: Set<string>;
fileExclusions?: string[];
customPatterns?: Record<string, any>;
customSecretPatterns?: Record<string, any>;
verbose?: boolean;
}
Ignore File Configuration
Create a .code-tokenizer-md-ignore
file in any directory to specify exclusions. The tool supports nested ignore files that affect their directory and subdirectories.
Example .code-tokenizer-md-ignore
:
# Ignore specific files
secrets.json
config.private.ts
# Ignore directories
build/
temp/
# Glob patterns
**/*.test.ts
**/._*
Default Exclusions
The tool automatically excludes common file types and patterns:
File Types:
- Images: .jpg, .jpeg, .png, .gif, .bmp, .svg, .webp, etc.
- Fonts: .ttf, .woff, .woff2, .eot, .otf
- Binaries: .exe, .dll, .so, .dylib, .bin
- Archives: .zip, .tar, .gz, .rar, .7z
- Media: .mp3, .mp4, .avi, .mov, .wav
- Data: .db, .sqlite, .sqlite3
- Config: .lock, .yaml, .yml, .toml, .conf
File Patterns:
- Configuration files: .*rc, tsconfig.json, package-lock.json
- Version control: .git*, .hg*, .svn*
- Environment files: .env*
- Build outputs: build/, dist/, out/
- Dependencies: node_modules/
- Documentation: docs/, README*, CHANGELOG*
- IDE settings: .idea/, .vscode/
- Test files: test/, spec/, tests/
Development
This project uses Bun for development. To contribute:
Setup
git clone <repository>
cd code-tokenizer-md
bun install
Scripts
bun run build
bun test
bun run lint
bun run lint:fix
bun run format
bun run fix
bun run dev
bun run deploy:dev
Project Structure
src/
├── index.ts # Main exports
├── TokenCleaner.ts # Code cleaning and redaction
├── MarkdownGenerator.ts # Markdown generation logic
├── cli.ts # CLI implementation
├── fileExclusions.ts # File exclusion patterns
└── fileTypeExclusions.ts # File type exclusions
Contributing
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
Guidelines
- Write TypeScript code following the project's style
- Include appropriate error handling
- Add documentation for new features
- Include tests for new functionality
- Update the README for significant changes
Note
This tool requires a git repository to function properly as it uses git ls-files
to identify tracked files.
License
GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007
© 2024 Geoff Seemueller