GitHub Docs Vectorizer
A Node.js tool to process Markdown files from GitHub repositories, generate embeddings, and store them in Upstash Vector database. Perfect for building document search systems, AI-driven documentation assistants, or knowledge bases.
Features
- Recursively find all Markdown (
.md
) and MDX (.mdx
) files in any GitHub repository - Chunk documents using LangChain's RecursiveCharacterTextSplitter for better text segmentation
- Supports both OpenAI and Upstash embeddings
- Stores document chunks and metadata in Upstash Vector for enhanced retrieval
Prerequisites
- Node.js (v16 or higher)
- NPM or Yarn for package management
- GitHub personal access token (required for repository access)
- Upstash Vector database account (to store vectors)
- OpenAI API key (optional, for generating embeddings)
How to Find Your GitHub Token
Click to expand instructions for getting your GitHub token
- Go to GitHub.com and sign in to your account
- Click on your profile picture in the top-right corner
- Go to
Settings
> Developer settings
> Personal access tokens
> Tokens (classic)
- Click
Generate new token
> Generate new token (classic)
- Give your token a descriptive name in the "Note" field
- Select the following scopes:
repo
(Full control of private repositories)read:org
(Read organization data)
- Click
Generate token
Installation Guide
- Clone the repository or create a new directory:
mkdir github-docs-vectorizer
cd github-docs-vectorizer
-
Ensure the following files are included in your directory:
script.js
: The main script for processingpackage.json
: Manages project dependencies.env
: Contains your environment variables (explained below)
-
Install dependencies:
npm install @upstash/docs2vector
- Set up a
.env
file in the root directory of your project with your credentials:
# Required for accessing GitHub repositories
GITHUB_TOKEN=your_github_token
# Required for storing vectors in Upstash
UPSTASH_VECTOR_REST_URL=your_upstash_vector_url
UPSTASH_VECTOR_REST_TOKEN=your_upstash_vector_token
# Optional: Provide if using OpenAI embeddings
OPENAI_API_KEY=your_openai_api_key
Usage
Run the script by providing the GitHub repository URL as an argument:
node script.js https://github.com/username/repository
Example:
node script.js https://github.com/facebook/react
The script will:
- Clone the specified repository
- Find all Markdown files
- Split content into chunks
- Generate embeddings (using either OpenAI or Upstash)
- Store the chunks in your Upstash Vector database
- Clean up temporary files
Configuration
Embedding Options
Supported Embedding Providers
-
OpenAI Embeddings (default if API key is provided)
- Requires
OPENAI_API_KEY
in .env
- Uses OpenAI's text-embedding-ada-002 model
-
Upstash Embeddings (used when OpenAI API key is not provided)
- No additional configuration needed
- Uses Upstash's built-in embedding service
Customizing Document Chunking
To adjust how documents are split into chunks, you can update the configuration in script.js
:
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
SDK
npm install @upstash/docs2vector dotenv
import Docs2Vector from '@upstash/docs2vector';
import dotenv from 'dotenv';
dotenv.config();
async function main() {
try {
const githubRepoUrl = 'YOUR_GITHUB_URL';
console.log(`Starting processing for the repository: ${githubRepoUrl}`);
const converter = new Docs2Vector();
await converter.run(githubRepoUrl);
console.log(`Successfully processed repository: ${githubRepoUrl}`);
console.log('Vectors stored in Upstash Vector database.');
} catch (error) {
console.error('An error occurred while processing the repository:', error.message);
}
}
main();
Metadata
Metadata accompanies each stored chunk for improved context:
- Original file name
- File type (Markdown or MDX)
- Relative file path in the repository
- Document source for the specific chunk of text
Error Handling
The script is designed to handle errors gracefully in the following cases:
- Invalid repository URLs provided
- Missing or incorrect credentials
- Unable to access or read the required files
- Connectivity or network-related problems
- Network problems
In case of errors, the script will:
- Log the error message
- Clean up any temporary files
- Exit with a non-zero status code
Contributing
Feel free to submit issues and enhancement requests!
License
MIT License - feel free to use this tool for any purpose.
Credits
This tool uses the following open-source packages:
- LangChain: Handles document processing and vector store integration
- Octokit: Facilitates interactions with the GitHub API
- simple-git: Manages operations on Git repositories
- Upstash Vector: Enables seamless storage and retrieval of document vectors