
Security News
Another Round of TEA Protocol Spam Floods npm, But It’s Not a Worm
Recent coverage mislabels the latest TEA protocol spam as a worm. Here’s what’s actually happening.
grepctl is a command-line and programmatic utility that enables semantic search across heterogeneous data lakes. By leveraging Google Cloud's advanced AI services and BigQuery's vector search capabilities, grepctl transforms unstructured data into a semantically searchable index. We describe the data ingestion pipeline, multimodal processing architecture, and the multiple interfaces—CLI, Web, Python, and SQL—that make this system both powerful and accessible.
grepctl processes 9 different data types automatically:
| Modality | Processing Method |
|---|---|
| Text/Markdown | Direct content extraction, preserving structure |
| OCR via Google Document AI for text extraction | |
| Office Documents | Document AI extracts content from .docx, .xlsx, .pptx |
| Images | Vision API extracts labels, text, objects, and faces |
| Audio | Speech-to-Text API transcribes to searchable text |
| Video | Video Intelligence API analyzes frames and transcribes speech |
| JSON/CSV | Structured data parsing with field preservation |
grepctl supports nine data modalities, including text, PDFs, office documents, images, audio, video, and structured JSON/CSV files. Each modality undergoes tailored extraction and processing steps, such as OCR for scanned documents and transcription for audio/video. All processed content is chunked and embedded into a 768-dimensional vector space.
grepctl ingest -b <bucket>
Access your indexed data through multiple interfaces:
CLI - Command-line search:
grepctl search "your query"
Web Interface - Interactive UI:
grepctl serve
Python Interface - Programmatic access:
from grepctl.search.vector_search import SemanticSearch
results = searcher.search("query", top_k=10)
SQL Interface - Direct BigQuery queries:
WITH query_embedding AS (
SELECT ml_generate_embedding_result AS embedding
FROM ML.GENERATE_EMBEDDING(
MODEL `project.mmgrep.text_embedding_model`,
(SELECT 'your search string' AS content)
)
)
SELECT doc_id, text_content, distance AS score
FROM VECTOR_SEARCH(
TABLE `project.mmgrep.search_corpus`,
'embedding',
(SELECT embedding FROM query_embedding),
top_k => 10
)
All interfaces leverage BigQuery's VECTOR_SEARCH for sub-second semantic search across your entire data lake.
EXTERNAL_QUERY functionEach processing pipeline outputs structured text that is then embedded using Vertex AI's text-embedding-004 model, creating 768-dimensional vectors optimized for semantic similarity search.
The search functions are automatically created when you run:
grepctl setup
This creates the following functions in your BigQuery dataset:
search(query) - Simple search with defaultssemantic_search(query, top_k, min_relevance) - Full control searchsearch_by_source(query, sources, top_k) - Filter by file typessearch_by_date(query, start_date, end_date, top_k) - Date range searchsearch_content(query, limit) - Just return content strings-- Simple search (defaults: top_k=10, min_relevance=0.0)
CALL `your-project.grepmm.search`("your query");
-- Full semantic search
CALL `your-project.grepmm.semantic_search`(
"query text", -- Search query
20, -- Number of results
0.7 -- Minimum relevance (0-1)
);
-- Returns:
-- doc_id, uri, source, modality, text_content,
-- relevance_score, created_at, metadata
-- Search by source types
CALL `your-project.grepmm.search_by_source`(
"query",
["pdf", "markdown"], -- Array of sources
10 -- Top K results
);
-- Search by date range
CALL `your-project.grepmm.search_by_date`(
"query",
DATE('2024-01-01'), -- Start date
CURRENT_DATE(), -- End date
15 -- Top K results
);
-- Get just content
CALL `your-project.grepmm.search_content`(
"query",
5 -- Limit
);
The functions handle all the complexity of embeddings and vector search - you just write simple SQL queries!
# Using uv (recommended)
uv add grepctl
# Using pip
pip install grepctl
# For development
git clone <repo>
cd bq-semgrep
uv sync
The SearchClient will automatically use your existing grepctl configuration from ~/.grepctl/config.yaml:
project_id: your-project
dataset_name: grepmm
location: us-central1
Or you can specify a custom config path:
client = SearchClient(config_path="/path/to/config.yaml")
Or override the project ID:
client = SearchClient(project_id="my-project-id")
# Full search with all options
results = client.search(
query="search text", # Search query
top_k=10, # Number of results
sources=['pdf', 'text'], # Filter by source types
rerank=False, # Use LLM reranking
regex_filter=r"pattern", # Regex filter
start_date="2023-01-01", # Date range start
end_date="2024-12-31" # Date range end
)
# Simple search - just returns content strings
contents = client.search_simple("query", limit=5)
# Get system statistics
stats = client.get_stats()
from grepctl import search
# Quick search without client
results = search("query", top_k=10, rerank=True)
You now have a powerful, simple Python API for semantic search across all your data. The SearchClient handles all the complexity of BigQuery connections, embedding models, and vector search - you just focus on building great applications!
FAQs
One-command orchestration for multimodal semantic search in BigQuery
We found that grepctl demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Recent coverage mislabels the latest TEA protocol spam as a worm. Here’s what’s actually happening.

Security News
PyPI adds Trusted Publishing support for GitLab Self-Managed as adoption reaches 25% of uploads

Research
/Security News
A malicious Chrome extension posing as an Ethereum wallet steals seed phrases by encoding them into Sui transactions, enabling full wallet takeover.