New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →

Book a Demo Sign in

@qvac/llm-llamacpp

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

@qvac/llm-llamacpp

llama addon for qvac

latest

Source

npm

Version: 0.14.3

Version published: 4 days ago

Maintainers: 3

Created: 8 months ago

Source

qvac-lib-infer-llamacpp-llm

This native C++ addon, built using the Bare Runtime, simplifies running Large Language Models (LLMs) within QVAC runtime applications. It provides an easy interface to load, execute, and manage LLM instances.

Supported platforms

Platform	Architecture	Min Version	Status	GPU Support
macOS	arm64, x64	14.0+	✅ Tier 1	Metal
iOS	arm64	17.0+	✅ Tier 1	Metal
Linux	arm64, x64	Ubuntu-22+	✅ Tier 1	Vulkan
Android	arm64	12+	✅ Tier 1	Vulkan, OpenCL (Adreno 700+)
Windows	x64	10+	✅ Tier 1	Vulkan

Note — BitNet models (TQ1_0 / TQ2_0 quantization): BitNet models require special backend handling on Adreno GPUs. When a BitNet model is detected and no explicit main-gpu is set:

Adreno 800+ (e.g. Adreno 830): Vulkan is used instead of OpenCL.
Adreno < 800 (e.g. Adreno 740): Falls back to CPU, as TQ kernels are not yet optimized for older Adreno OpenCL/Vulkan.
Non-Adreno GPUs: Normal GPU selection applies (no special behavior).

Dependencies:

qvac-lib-inference-addon-cpp (≥1.1.2): C++ addon framework (single-job runner)
qvac-fabric-llm.cpp (≥7248.2.1): Inference engine
Bare Runtime (≥1.24.0): JavaScript runtime
Linux requires Clang/LLVM 19 with libc++

Installation

Prerequisites

Ensure that the Bare Runtime is installed globally on your system. If it's not already installed, you can install it using:

npm install -g bare@latest

Installing the Package

npm install @qvac/llm-llamacpp@latest

Building from Source

See build.md for detailed instructions on how to build the addon from source.

Usage

1. Import the Model Class

const LlmLlamacpp = require('@qvac/llm-llamacpp')

2. Create a Data Loader

Data Loaders abstract the way model files are accessed. Use a FileSystemDataLoader to load model files from your local file system. Models can be downloaded directly from HuggingFace.

const FilesystemDL = require('@qvac/dl-filesystem')

// Download model from HuggingFace (see examples/utils.js for downloadModel helper)
const [modelName, dirPath] = await downloadModel(
  'https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf',
  'Llama-3.2-1B-Instruct-Q4_0.gguf'
)

const fsDL = new FilesystemDL({ dirPath })

3. Create the `args` obj

const args = {
  loader: fsDL,
  opts: { stats: true },
  logger: console,
  diskPath: dirPath,
  modelName,
  // projectionModel: 'mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf' // for multimodal support you need to pass the projection model name
}

The args obj contains the following properties:

loader: The Data Loader instance from which the model file will be streamed.
logger: This property is used to create a QvacLogger instance, which handles all logging functionality.
opts.stats: This flag determines whether to calculate inference stats.
diskPath: The local directory where the model file will be downloaded to.
modelName: The name of model file in the Data Loader.
projectionModel: The name of the projection model file in the Data Loader. This is required for multimodal support.

4. Create the `config` obj

The config obj consists of a set of hyper-parameters which can be used to tweak the behaviour of the model.
All parameters must by strings.

// an example of possible configuration
const config = {
  gpu_layers: '99', // number of model layers offloaded to GPU.
  ctx_size: '1024', // context length
  device: 'cpu' // must be specified: 'gpu' or 'cpu' else it will throw an error
}

Parameter	Range / Type	Default	Description
device	`"gpu"` or `"cpu"`	— (required)	Device to run inference on
gpu_layers	integer	0	Number of model layers to offload to GPU
ctx_size	0 – model-dependent	4096 (0 = loaded from model)	Context window size
lora	string	—	Path to LoRA adapter file
temp	0.00 – 2.00	0.8	Sampling temperature
top_p	0 – 1	0.9	Top-p (nucleus) sampling
top_k	0 – 128	40	Top-k sampling
predict	integer (-1 = infinity)	-1	Maximum tokens to predict
seed	integer	-1 (random)	Random seed for sampling
no_mmap	"" (passing empty string sets the flag)	—	Disable memory mapping for model loading
reverse_prompt	string (comma-separated)	—	Stop generation when these strings are encountered
repeat_penalty	float	1.1	Repetition penalty
presence_penalty	float	0	Presence penalty for sampling
frequency_penalty	float	0	Frequency penalty for sampling
tools	`"true"` or `"false"`	`"false"`	Enable tool calling with jinja templating
tools_at_end	`"true"` or `"false"`	`"false"`	Place tools at end of prompt (details)
verbosity	0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG)	0	Logging verbosity level
n_discarded	integer	0	Tokens to discard in sliding window context
main-gpu	integer, `"integrated"`, or `"dedicated"`	—	GPU selection for multi-GPU systems

IGPU/GPU selection logic:

Scenario	main-gpu not specified	main-gpu: `"dedicated"`	main-gpu: `"integrated"`
Devices considered	All GPUs (dedicated + integrated)	Only dedicated GPUs	Only integrated GPUs
System with iGPU only	✅ Uses iGPU	❌ Falls back to CPU	✅ Uses iGPU
System with dedicated GPU only	✅ Uses dedicated GPU	✅ Uses dedicated GPU	❌ Falls back to CPU
System with both	✅ Uses dedicated GPU (preferred)	✅ Uses dedicated GPU	✅ Uses integrated GPU

5. Create Model Instance

const model = new LlmLlamacpp(args, config)

6. Load Model

await model.load()

Optionally you can pass the following parameters to tweak the loading behaviour.

close?: This boolean value determines whether to close the Data Loader after loading. Defaults to true
reportProgressCallback?: A callback function which gets called periodically with progress updates. It can be used to display overall progress percentage.

For example:

await model.load(false, progress => process.stdout.write(`\rOverall Progress: ${progress.overallProgress}%`))

Progress Callback Data

The progress callback receives an object with the following properties:

Property	Type	Description
`action`	string	Current operation being performed
`totalSize`	number	Total bytes to be loaded
`totalFiles`	number	Total number of files to process
`filesProcessed`	number	Number of files completed so far
`currentFile`	string	Name of file currently being processed
`currentFileProgress`	string	Percentage progress on current file
`overallProgress`	string	Overall loading progress percentage

7. Run Inference

Pass an array of messages (following the chat completion format) to the run method. Process the generated tokens asynchronously:

try {
  const messages = [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is the capital of France?' }
  ]

  const response = await model.run(messages)
  const buffer = []

  // Option 1: Process streamed output using async iterator
  for await (const token of response.iterate()) {
    process.stdout.write(token) // Write token directly to output
    buffer.push(token)
  }

  // Option 2: Process streamed output using callback
  await response.onUpdate(token => { /* ... */ }).await()

  console.log('\n--- Full Response ---\n', buffer.join(''))

} catch (error) {
  console.error('Inference failed:', error)
}

When opts.stats is enabled, response.stats includes runtime metrics such as TTFT, TPS, token counters, and backendDevice ("cpu" or "gpu"). backendDevice reflects the resolved device used at runtime after backend selection/fallback logic, not only the requested config.

8. Release Resources

Unload the model when finished:

try {
  await model.unload()
  await fsDL.close()
} catch (error) {
  console.error('Failed to unload model:', error)
}

API behavior by state

The following table describes the expected behavior of run and cancel depending on the current state (idle vs a job running). cancel can be called on the model (model.cancel()) or on the response (response.cancel()); both target the same underlying job.

Current state	Action called	What happens
idle	run	Allowed — starts inference, returns `QvacResponse`
idle	cancel	Allowed — no-op (no job to cancel); Promise resolves
run	run	Throw — second `run()` throws "a job is already set or being processed" (can wait very briefly for previous job completion)
run	cancel	Allowed — cancels current job; Promise resolves when job has stopped

When run() is called while another job is active, the implementation first waits briefly for the previous job to settle. This preserves single-job behavior while still failing fast when the instance is busy. If the second run cannot be accepted (timeout or addon busy rejection), it throws:

"Cannot set new job: a job is already set or being processed"

Fine-tuning

The library supports LoRA finetuning of GGUF models: train small adapter weights on top of a base model, then save the adapter and load it at inference time via the lora config option. You can pause and resume training from checkpoints.

For the full API, dataset format, parameters, and examples, see the Finetuning guide.

Smart Home Showcase

A hands-on example that finetunes Qwen3-0.6B to act as a smart home tool-calling specialist. The base model tends to drift into conversational text or exhaust its token budget on reasoning — the finetuned adapter fixes both problems.

Train — smart-home-finetune.js runs a 1-epoch causal LoRA finetune on a 215-sample dataset of user requests paired with <tool_call> responses.
Evaluate — smart-home-finetuned-test.js runs the same prompts against the base model and the finetuned model, then prints a side-by-side comparison report (strictness, accuracy, thinking token usage, multi-turn stability).

Note on dataset diversity: The training dataset intentionally includes tool-calling samples from many domains (medical, irrigation, quantum, etc.), not just the 4 smart-home tools used in evaluation. The goal is to teach the model the general behavioral pattern — produce structured <tool_call> output instead of conversational text — rather than memorize specific tool names. The evaluation then tests whether that pattern transfers to smart-home prompts the model wasn't explicitly drilled on.

# Train the adapter
bare examples/finetune/showcase/smart-home-finetune.js

# Compare baseline vs finetuned
bare examples/finetune/showcase/smart-home-finetuned-test.js

Quickstart Example

Clone the repository and navigate to it:

cd qvac-lib-infer-llamacpp-llm

Install dependencies:

npm install

Run the quickstart example (uses examples/quickstart.js):

npm run quickstart

Other examples

SalamandraTA – Demonstrates SalamandraTA model usage.
Multimodal – Demonstrates how to run multimodal inference.
Multi-Cache – Demonstrates session handling and caching capabilities.
Native Logging – Demonstrates C++ addon logging integration.
Tool Calling – Demonstrates tool calling capabilities.
LoRA Finetuning – Basic LoRA finetuning.
LoRA Finetuning Pause/Resume – Pause and resume finetuning.
LoRA Inference – Inference with a finetuned LoRA adapter.
Smart Home Finetune Showcase – Train a smart home tool-calling specialist, then evaluate baseline vs finetuned.
Bench Tools Placement – Benchmarks standard vs tools_at_end placement across multi-turn conversations.
Test Tool Removal – Demonstrates dynamic tool addition and removal between turns.

OCR with Vision-Language Models

In addition to ONNX-based OCR (@qvac/ocr-onnx), you can use vision-language models through @qvac/llm-llamacpp for OCR tasks. This is useful for structured document understanding (tables, forms, multi-column layouts) where traditional OCR pipelines struggle.

Supported OCR Models

Model	Params	Quantization	Description
LightON OCR-2 1B	0.6B (LLM) + ~550M (vision)	Q4_K_M	OCR-specialized, full-page transcription, 11 languages
SmolVLM2-500M	500M	Q8_0	General vision-language, can follow targeted extraction prompts

LightON OCR-2

LightON OCR-2 is an OCR-specialized vision-language model (Apache 2.0) that produces detailed markdown/HTML output with tables. It supports 11 languages: English, French, German, Spanish, Italian, Dutch, Portuguese, Polish, Romanian, Czech, and Swedish.

Characteristics:

Always does full-page transcription regardless of prompt
Produces detailed structured output (markdown tables, HTML)
Requires --jinja flag / jinja chat template in llama.cpp
Requires both LLM model and F16 mmproj (vision projector)

Performance (Pixel 10 Pro, CPU-only, Q4_K_M + F16 mmproj):

Image encode: ~30s (768x1024 image)
Prompt eval: 26.6 t/s
Generation: 4.14 t/s

Usage Example:

const LlmLlamacpp = require('@qvac/llm-llamacpp')
const FilesystemDL = require('@qvac/dl-filesystem')
const fs = require('bare-fs')

const dirPath = './models'
const loader = new FilesystemDL({ dirPath })

const model = new LlmLlamacpp({
  modelName: 'LightOnOCR-2-1B-ocr-soup-Q4_K_M.gguf',
  loader,
  logger: console,
  diskPath: dirPath,
  projectionModel: 'mmproj-F16.gguf'
}, {
  device: 'cpu',
  gpu_layers: '0',
  ctx_size: '4096',
  temp: '0.1',
  predict: '2048'
})

await model.load()

const imageBytes = new Uint8Array(fs.readFileSync('./document.png'))

const messages = [
  { role: 'user', type: 'media', content: imageBytes },
  { role: 'user', content: 'Extract all text from this image and format it as markdown.' }
]

const response = await model.run(messages)
const output = []

response.onUpdate(token => {
  output.push(token)
})

await response.await()

console.log(output.join(''))

await model.unload()
await loader.close()

Architecture

See docs/ for a detailed explanation of the architecture and data flow logic.

Benchmarking

Comprehensive benchmarking suite for evaluating @qvac/llm-llamacpp addon (native C++ GGUF) on reasoning, comprehension, and knowledge tasks. Supports single-model evaluation and comparative analysis vs HuggingFace Transformers (Python).

Supported Datasets:

SQuAD (Reading Comprehension) - F1 Score
ARC (Scientific Reasoning) - Accuracy
MMLU (Knowledge) - Accuracy
GSM8K (Math Reasoning) - Accuracy

# Single model evaluation
npm run benchmarks -- \
  --gguf-model "bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_0" \
  --samples 10

# Compare addon vs transformers
npm run benchmarks -- \
  --compare \
  --gguf-model "bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_0" \
  --transformers-model "meta-llama/Llama-3.2-1B-Instruct" \
  --hf-token YOUR_TOKEN \
  --samples 10

Platform Support: Unix/Linux/macOS (bash), Windows (PowerShell, Git Bash)

→ For detailed guide, see benchmarks/README.md

Tests

Integration tests are located in test/integration/ and cover core functionality including model loading, inference, tool calling, multimodal capabilities, and configuration parameters.
These tests help prevent regressions and ensure the library remains stable as contributions are made to the project.

Unit tests are located in test/unit/ and test the C++ addon components at a lower level, including backend selection, cache management, chat templates, context handling, and UTF8 token processing.
These tests validate the native implementation and help catch issues early in development.

Glossary

• Bare Runtime – Small and modular JavaScript runtime for desktop and mobile. Learn more.

License

This project is licensed under the Apache-2.0 License – see the LICENSE file for details.

For questions or issues, please open an issue on the GitHub repository.

FAQs

What is @qvac/llm-llamacpp?

Is @qvac/llm-llamacpp well maintained?

Package last updated on 08 Apr 2026

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@qvac/llm-llamacpp

qvac-lib-infer-llamacpp-llm

Table of Contents

Supported platforms

Installation

Prerequisites

Installing the Package

Building from Source

Usage

1. Import the Model Class

2. Create a Data Loader

3. Create the `args` obj

4. Create the `config` obj

IGPU/GPU selection logic:

5. Create Model Instance

6. Load Model

7. Run Inference

8. Release Resources

API behavior by state

Fine-tuning

Smart Home Showcase

Quickstart Example

Other examples

OCR with Vision-Language Models

Supported OCR Models

LightON OCR-2

Architecture

Benchmarking

Tests

Glossary

License

Related posts

@qvac/llm-llamacpp

qvac-lib-infer-llamacpp-llm

Table of Contents

Supported platforms

Installation

Prerequisites

Installing the Package

Building from Source

Usage

1. Import the Model Class

2. Create a Data Loader

3. Create the args obj

4. Create the config obj

IGPU/GPU selection logic:

5. Create Model Instance

6. Load Model

7. Run Inference

8. Release Resources

API behavior by state

Fine-tuning

Smart Home Showcase

Quickstart Example

Other examples

OCR with Vision-Language Models

Supported OCR Models

LightON OCR-2

Architecture

Benchmarking

Tests

Glossary

License

Related posts

Don't Kill the Goose That Lays the Golden Eggs

Feross on TBPN: How North Korea Hijacked Axios

3. Create the `args` obj

4. Create the `config` obj