@ruvector/ruvllm - npm Package Compare versions

Comparing version

2.5.3

2.5.4

+14

-3

package.json

		{
		"name": "@ruvector/ruvllm",
		"version": "2.5.3",
		"description": "Self-learning LLM orchestration with SONA adaptive learning, HNSW memory, FastGRNN routing, and SIMD inference",
		"version": "2.5.4",
		"description": "Self-learning LLM runtime — TurboQuant KV-cache (6-8x compression), SONA adaptive learning, FlashAttention, speculative decoding, GGUF inference",
		"main": "dist/cjs/index.js",
		@@ -95,3 +95,14 @@ "module": "dist/esm/index.js",
		"rust",
		"ruvector"
		"ruvector",
		"turboquant",
		"kv-cache",
		"quantization",
		"flash-attention",
		"speculative-decoding",
		"gguf",
		"mamba",
		"transformer",
		"edge-ai",
		"local-llm",
		"model-compression"
		],
		@@ -98,0 +109,0 @@ "author": "rUv Team <team@ruv.io>",

+37

-5

README.md

		@@ -1,5 +0,12 @@
		# @ruvector/ruvllm v2.3
		# @ruvector/ruvllm

		Self-learning LLM orchestration with SONA adaptive learning, HNSW memory, and SIMD inference for Node.js.
		[![npm version](https://img.shields.io/npm/v/@ruvector/ruvllm.svg)](https://www.npmjs.com/package/@ruvector/ruvllm)
		[![Downloads](https://img.shields.io/npm/dm/@ruvector/ruvllm)](https://www.npmjs.com/package/@ruvector/ruvllm)
		[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
		[![GitHub Stars](https://img.shields.io/github/stars/ruvnet/ruvector?style=social)](https://github.com/ruvnet/ruvector)

		Self-learning LLM runtime for Node.js — GGUF inference, TurboQuant KV-cache compression (6-8x memory savings), SONA adaptive learning, FlashAttention, speculative decoding, and SIMD-optimized kernels. Built in Rust, runs everywhere.

		> Inference at 88-135 tok/s on M4 Pro \| <1ms SONA adaptation \| 6-8x KV-cache compression via TurboQuant

		## Installation
		@@ -37,6 +44,10 @@

		## What's New in v2.3
		## What's New in v2.5

		\| Feature \| Description \|
		\|---------\|-------------\|
		\| TurboQuant KV-Cache \| 2-4 bit asymmetric quantization with per-channel scale/zero-point — 6-8x memory reduction, <0.5% perplexity loss \|
		\| TurboQuant Embedding Store \| Quantized vector storage with compressed search — 10-30x memory savings \|
		\| H2O / PyramidKV Eviction \| Intelligent cache eviction policies for long-context inference \|
		\| Optimized Inner Product \| Asymmetric distance on quantized data — skip decompression for 2-4x faster search \|
		\| RuvLTRA Models \| Purpose-built 0.5B & 3B models for Claude Flow \|
		@@ -48,5 +59,26 @@ \| Task-Specific LoRA \| 5 pre-trained adapters (coder, researcher, security, architect, reviewer) \|
		\| Evaluation Harness \| SWE-Bench testing with 5 ablation modes \|
		\| Auto-Dimension \| HNSW auto-detects model embedding size \|
		\| mistral-rs Backend \| Production serving with PagedAttention, X-LoRA, ISQ (5-10x concurrent users) \|
		\| mistral-rs Backend \| Production serving with PagedAttention, X-LoRA, ISQ \|

		## TurboQuant — KV-Cache Compression

		Reduce inference memory by 6-8x with <0.5% quality loss:

		```typescript
		import { simd } from '@ruvector/ruvllm/simd';

		// TurboQuant compresses KV-cache entries at 2-4 bit precision
		// with per-channel asymmetric quantization (scale + zero-point).
		// Eviction policies (H2O, Sliding Window, PyramidKV) keep the
		// most important tokens in cache during long-context generation.

		// Supported bit widths: 2-bit (32x), 3-bit (10.7x), 4-bit (8x), 8-bit (4x)
		```

		\| Bits \| Compression \| Perplexity Loss \| Use Case \|
		\|------\|-------------\|-----------------\|----------\|
		\| 2-bit \| 32x \| ~2% \| Maximum compression, edge devices \|
		\| 3-bit \| 10.7x \| <1% \| Balanced — recommended for most uses \|
		\| 4-bit \| 8x \| <0.5% \| High quality, long-context inference \|
		\| 8-bit \| 4x \| ~0% \| Baseline quantization \|

		## CLI Usage
		@@ -53,0 +85,0 @@

@ruvector/ruvllm - npm Package Compare versions

New alerts

Fixed alerts

Improved metrics