GitHub Docs Vectorizer
A Node.js tool to process Markdown files from GitHub repositories, generate embeddings, and store them in Upstash Vector database. Perfect for building document search systems, AI-driven documentation assistants, or knowledge bases.
Features
-
Clone any GitHub repository
-
Recursively find all Markdown (.md
) and MDX (.mdx
) files
-
Chunk documents using LangChain's RecursiveCharacterTextSplitter for better text segmentation
-
Supports both OpenAI and Upstash embeddings
-
Stores document chunks and metadata in Upstash Vector for enhanced retrieval
-
Handles cleanup automatically
-
Preserves file metadata for better context during retrieval
Prerequisites
- Node.js (v16 or higher) installed on your machine
- NPM or Yarn for package management
- GitHub personal access token (required for repository access)
- Upstash Vector database account (to store vectors)
- OpenAI API key (optional, for generating embeddings)
How to Find Your GitHub Token
Click to expand instructions for getting your GitHub token
- Go to GitHub.com and sign in to your account
- Click on your profile picture in the top-right corner
- Go to
Settings
> Developer settings
> Personal access tokens
> Tokens (classic)
- Click
Generate new token
> Generate new token (classic)
- Give your token a descriptive name in the "Note" field
- Select the following scopes:
repo
(Full control of private repositories)read:org
(Read organization data)
- Click
Generate token
- Important: Copy the token immediately and store it securely. You won't be able to see it again!
Note: If you're only accessing public repositories, you can create a token with just the public_repo
scope instead of the full repo
scope.
For security best practices:
- Never commit your token to version control
- Use environment variables or secure secret management
- Set an expiration date for your token
- Only grant the minimum required permissions
Installation Guide
- Clone the repository or create a new directory:
mkdir github-docs-vectorizer
cd github-docs-vectorizer
-
Ensure the following files are included in your directory:
script.js
: The main script for processingpackage.json
: Manages project dependencies.env
: Contains your environment variables (explained below)
-
Install dependencies:
npm install
- Set up a
.env
file in the root directory of your project with your credentials:
# Required for accessing GitHub repositories
GITHUB_TOKEN=your_github_token
# Required for storing vectors in Upstash
UPSTASH_VECTOR_REST_URL=your_upstash_vector_url
UPSTASH_VECTOR_REST_TOKEN=your_upstash_vector_token
# Optional: Provide if using OpenAI embeddings
OPENAI_API_KEY=your_openai_api_key
Usage
Run the script by providing the GitHub repository URL as an argument:
node script.js https://github.com/username/repository
Example:
node script.js https://github.com/facebook/react
The script will:
- Clone the specified repository
- Find all Markdown files
- Split content into chunks
- Generate embeddings (using either OpenAI or Upstash)
- Store the chunks in your Upstash Vector database
- Clean up temporary files
Configuration
Embedding Options
Supported Embedding Providers
-
OpenAI Embeddings (default if API key is provided)
- Requires
OPENAI_API_KEY
in .env
- Uses OpenAI's text-embedding-ada-002 model
-
Upstash Embeddings (used when OpenAI API key is not provided)
- No additional configuration needed
- Uses Upstash's built-in embedding service
Customizing Document Chunking
To adjust how documents are split into chunks, you can update the configuration in script.js
:
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
Metadata
Metadata accompanies each stored chunk for improved context:
- Original file name
- File type (Markdown or MDX)
- Relative file path in the repository
- Document source for the specific chunk of text
Error Handling
The script is designed to handle errors gracefully in the following cases:
- Invalid repository URLs provided
- Missing or incorrect credentials
- Unable to access or read the required files
- Connectivity or network-related problems
- Network problems
In case of errors, the script will:
- Log the error message
- Clean up any temporary files
- Exit with a non-zero status code
Contributing
Feel free to submit issues and enhancement requests!
License
MIT License - feel free to use this tool for any purpose.
Credits
This tool uses the following open-source packages:
- LangChain: Handles document processing and vector store integration
- Octokit: Facilitates interactions with the GitHub API
- simple-git: Manages operations on Git repositories
- Upstash Vector: Enables seamless storage and retrieval of document vectors