Big News: Socket raises $60M Series C at a $1B valuation to secure software supply chains for AI-driven development.Announcement
Sign In

@fugood/buttress-server

Package Overview
Dependencies
Maintainers
5
Versions
64
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

@fugood/buttress-server

A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.

latest
npmnpm
Version
2.24.6
Version published
Maintainers
5
Created
Source

Buttress Server

A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.

Installation

npm install -g @fugood/buttress-server

Quick Start

Using CLI

# Start with config file
npx bricks-buttress --config ./config.toml

# Start without config (uses env vars and defaults)
npx bricks-buttress

Workspace Binding (bricks buttress)

By default, a buttress-server runs in public mode: any client on the LAN can connect, no auth required. To restrict access to a single BRICKS workspace and enable workspace-scoped JWT auth, bind the server with the bricks buttress CLI commands. Once bound, the server only accepts WebSocket / file-transfer requests carrying a valid access token signed by that workspace's issuer.

The bricks CLI is the tool that performs the binding and writes the local state file. Install it first — see the bricks-cli docs — then bricks auth login with the workspace owner's account before running the commands below.

Bind a server to a workspace

# Pair the local machine's buttress-server with the workspace of the current bricks-cli profile
bricks buttress bind

# Override the auto-detected server id, give it a friendly name, or write to a custom state dir
bricks buttress bind --server-id buttress-mac-studio --name "Studio LLM" --state-dir /etc/buttress

# For headless/remote setups: emit state.json to stdout instead of writing to disk
bricks buttress bind --print > /etc/buttress/state.json

The state file (~/.bricks-cli/buttress/state.json by default, or $BRICKS_BUTTRESS_STATE_DIR) stores:

  • workspace.id / workspace.name — which workspace this server belongs to
  • workspace.serverId — the server's stable id (defaults to buttress-<machineId>)
  • workspace.issuerPublicKey + workspace.kid — Ed25519 SPKI used to verify access tokens

Restart bricks-buttress after binding for the change to take effect — the state file is read once at startup.

Inspect bindings

# Show local state.json + the workspace-side bound list
bricks buttress status

# Same, JSON-formatted
bricks buttress status --json

Discover servers on the LAN

# UDP scan + HTTP /buttress/info verification (3s timeout by default)
bricks buttress scan

# UDP only (skip the /buttress/info round-trip)
bricks buttress scan --udp-only

# Machine-readable
bricks buttress scan --json

scan lists every buttress-server visible on the LAN, including unbound (public) ones, with their version, auth state (open vs JWT required + kid), bound workspace, and per-generator hardware caps (score, GPU, usable memory). Servers whose workspace matches your current bricks-cli profile are highlighted; this is purely a discovery command and does not mint any tokens.

Unbind

# Remove the binding from the workspace and delete the local state.json
bricks buttress unbind

# Keep the local state file (useful if you only want to revoke server-side)
bricks buttress unbind --keep-local

After unbinding, restart the server to return it to public mode.

Issue a long-lived access token

For headless callers (CI, ctor agents) that already hold a workspace token, mint a long-lived buttress access token instead of relying on a per-launcher session token:

# Default 30-day TTL
bricks buttress issue-token

# Custom TTL (seconds), JSON output for scripting
bricks buttress issue-token --ttl 3600 --json

The token claims { k: 'ba', w_id, st: 'ws', sid, jti, exp } and any buttress-server bound to the same workspace will accept it.

Configuration

Configuration is loaded from a TOML file passed via --config / -c. Every top-level table is optional — missing sections fall back to defaults. See config/sample.toml for an end-to-end example.

Top-level sections

SectionPurpose
[env]Environment variables exported into the process only if not already set
[server]HTTP/RPC listener (port, log level, body limits)
[runtime]Global defaults shared by every generator (most [generators.model] keys may live here too)
[runtime.session_cache]KV-cache reuse store — see Session State Cache
[autodiscover]LAN UDP / HTTP / mDNS discovery toggles
[openai_compat]Enable /oai-compat/v1/* — see Compatibility Endpoints
[anthropic_messages]Enable /anthropic-messages — see Compatibility Endpoints
[[generators]]Array of generator instances — one entry per loaded model

[env]

[env]
HUGGINGFACE_TOKEN = "hf_xxx"   # ggml backends read this; HF_TOKEN is not picked up automatically
CUDA_VISIBLE_DEVICES = "0"

Values here are exported only when the variable isn't already set in the process — see Environment Variable Priority. For HuggingFace auth across all backends, [runtime] huggingface_token = "hf_xxx" works regardless of variable name.

[server]

KeyTypeDefault
idstringbuttress-<machineId> — stable id used for autodiscover / binding
namestringButtress Server (<short id>) — display name
portnumber2080 (overridden by --port)
log_level"debug"/"info"/"warn"/"error"unset
max_body_sizestring|number"50MB" — e.g. "100MB", "1GB", or raw bytes
session_timeoutstring|number60000 ms — accepts ms numbers or duration strings ("30s")
temp_file_dirstring$TMPDIR/.buttress

[runtime] — global generator defaults

Most ggml-llm [generators.model] keys can also live in [runtime] as defaults. Per-generator values win; otherwise the runtime default applies.

KeyTypeNotes
cache_dirstringModel + metadata cache root (default ~/.buttress/models)
huggingface_tokenstringFalls back to $HUGGINGFACE_TOKEN
http_headerstableExtra headers attached to HF / HTTP downloads
context_release_delay_msnumberIdle time before unloading a context (default 10000; 0 = immediate)
prefer_variantsstring[]Override variant probe order (ggml backends)
n_threadsnumberCPU thread count
n_ctxnumberContext window (per-model value wins; auto-capped at training context)
n_gpu_layersnumber|"auto"Layers offloaded to GPU (default "auto")
n_batch / n_ubatchnumberPrompt batch / micro-batch size. Note: n_batch has a model-level default of 512 that shadows the runtime value unless [generators.model] n_batch is set explicitly.
n_parallelnumberParallel sequences (default 4)
n_cpu_moenumberMoE expert layers offloaded to CPU
flash_attn_type"on" / "off" / "auto"When a GPU backend is selected, defaults to "auto"; on CPU, defaults to "off". Explicit "on" / "off" / "auto" overrides.
cache_type_k, cache_type_vstringKV-cache dtype (f16, f32, q8_0, q4_0, …)
kv_unifiedbooleanUse a unified KV cache across sequences
swa_fullbooleanMaterialize full attention even for sliding-window layers
ctx_shiftbooleanAllow llama.cpp's rolling context shift
use_mmap, use_mlockbooleanMemory-mapping / locking
no_extra_buftsbooleanDisable extra compute buffer types
cpu_mask, cpu_strictstring / booleanCPU affinity (advanced)
devicesstring[]Restrict to specific GGML devices
Speculative keysvariousspeculative, spec_type, spec_draft_n_max/n_min/p_min/p_split

[autodiscover]

Set autodiscover = true for defaults, false (or omit) to disable, or a table for fine control:

[autodiscover]
udp.port = 8089
udp.announcements = { enabled = true, interval = 5000 }
udp.requests       = { enabled = true, responseDelay = 100 }
http.enabled = true
http.path    = "/buttress/info"
http.cors    = true
# mdns.enabled = false   # Bonjour/Avahi advertisement (optional)

[[generators]]

Every generator entry has a type, an optional [generators.backend] table, and a [generators.model] table:

[[generators]]
type = "ggml-llm"        # or "ggml-stt" / "mlx-llm"

[generators.backend]
# (see per-type sections below)

[generators.model]
repo_id = "..."
# (see per-type sections below)

Common [generators.model] keys

Shared by all generator types:

KeyTypeNotes
repo_id (required)stringHuggingFace repo (org/repo)
revisionstringDefault "main"
downloadbooleanPre-download at server startup (default false)

Additional keys honored by ggml-llm and ggml-stt (mlx-llm gets quantization from the repo itself and does not use these):

KeyTypeNotes
filenamestringPin a specific artifact in the repo
urlstringDirect download URL (skips manifest lookup)
quantizationstringPreferred quant tag — e.g. q4_0, q8_0, mxfp4
preferred_quantizationsstring[]Ordered fallback list when quantization doesn't match (alias: quantizations)
allow_local_filebooleanRequired to use local_path / mmproj_local_path
local_pathstringUse a local file as the load path. Repo metadata is still resolved from HF, so repo_id is still required.
api_base, base_urlstringOverride HF API / blob hosts (mirrors / proxies)

ggml-llm (llama.cpp via @fugood/llama.node)

Loads a GGUF LLM. Runtime keys above can be overridden per-generator under [generators.model]; [generators.backend] only controls backend selection and resource planning.

[generators.backend]

KeyTypeDefaultNotes
variantstringautoForce cuda / vulkan / snapdragon / default
variant_preferencestring[]["cuda","vulkan","snapdragon","default"]Probe order when variant is unset
gpu_memory_fractionnumber0.85Max GPU fraction the hardware guardrails may plan against
cpu_memory_fractionnumber0.5Max RAM fraction for CPU-side buffers

[generators.model] — in addition to the common keys above:

KeyTypeNotes
n_ctxnumberContext window. Auto-capped at the model's training context.
n_gpu_layersnumber|"auto"Layers offloaded to GPU (default "auto")
n_batchnumberPrompt batch size (default 512)
n_ubatch, n_threads, n_parallel, n_cpu_moenumberSame semantics as the [runtime] defaults
flash_attn_type, cache_type_k, cache_type_v, kv_unified, swa_full, ctx_shift, use_mmap, use_mlock, no_extra_bufts, cpu_mask, cpu_strict, devicesvariousPer-model overrides for the [runtime] defaults

Multimodal (mtmd) — auto-downloads the matching mmproj-*.gguf from the same repo and calls initMultimodal:

KeyTypeNotes
enable_mtmdbooleanDefault false
mmproj_filenamestringPin a specific projector file
mmproj_urlstringDirect URL override
mmproj_local_pathstringLocal projector (requires allow_local_file = true)
mmproj_use_gpubooleannull = auto (true when n_gpu_layers > 0)
mmproj_image_min_tokensnumberMin visual tokens (dynamic-resolution models; -1 = unset)
mmproj_image_max_tokensnumberMax visual tokens (-1 = unset)

Speculative decoding

KeyTypeNotes
speculativestringDraft model identifier
spec_typestringStrategy (backend-defined)
spec_draft_n_maxintMax drafted tokens per step
spec_draft_n_minintMin drafted tokens
spec_draft_p_minnumberMin acceptance probability
spec_draft_p_splitnumberSplit threshold

Example

[[generators]]
type = "ggml-llm"
[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
gpu_memory_fraction = 0.95
[generators.model]
repo_id = "ggml-org/gpt-oss-20b-GGUF"
quantization = "mxfp4"
n_ctx = 12800
download = true

ggml-stt (whisper.cpp via @fugood/whisper.node)

Loads a Whisper GGML model for speech-to-text.

[generators.backend]

KeyTypeDefaultNotes
variantstringautocuda / vulkan / default
variant_preferencestring[]["cuda","vulkan","default"]Probe order
gpu_memory_fractionnumber0.85
cpu_memory_fractionnumber0.5

[generators.model] — common keys plus:

KeyTypeDefaultNotes
repo_idstring"BricksDisplay/whisper-ggml"Defaulted (unlike ggml-llm)
preferred_quantizationsstring[]["q8_0", <no-quant>, "q5_1"]Default fallback chain
use_gpubooleantrueForce-disable GPU even when available
use_flash_attn"on" / "off" / "auto" / boolean"auto""auto" enables flash-attn when GPU is in use. true/false are accepted as shortcuts for "on"/"off".

Runtime extras — under [runtime] for ggml-stt only:

KeyTypeNotes
max_threadsnumberCaps the whisper.cpp thread count

Example

[[generators]]
type = "ggml-stt"
[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
[generators.model]
repo_id = "BricksDisplay/whisper-ggml"
filename = "ggml-large-v3-turbo-q8_0.bin"
use_gpu = true
use_flash_attn = "on"
download = true

mlx-llm (Apple Silicon, Python mlx-lm / mlx-vlm bridge)

Loads an MLX-format model on Apple Silicon. On first use, the backend creates a virtualenv at {cache_dir}/mlx-env and installs mlx_lm_package, mlx_vlm_package, plus torch and torchvision (required by some VLM processors). If an existing venv already has mlx_vlm and torch importable, the install step is skipped. There is no [generators.backend] section.

[generators.model] — common repo_id / revision / download plus:

KeyTypeDefaultNotes
adapter_pathstringLocal LoRA adapter directory
vlm"auto" / boolean"auto"Force VLM (true) vs text-only (false); "auto" infers from the repo
tokenizer_configtableForwarded to mlx_lm.load(..., tokenizer_config=...)
model_configtableForwarded to mlx_lm.load(..., model_config=...)

quantization, filename, and preferred_quantizations are not used — the MLX repo itself determines the quantization.

Runtime extras — under [runtime] for mlx-llm:

KeyTypeDefaultNotes
mlx_env_dirstring{cache_dir}/mlx-envLocation of the auto-managed Python venv
mlx_lm_packagestring"mlx-lm==0.31.1"pip spec used when provisioning the venv
mlx_vlm_packagestring"mlx-vlm==0.4.0"pip spec used when provisioning the venv
session_cache.*tableenabled, 5GB, 100 entriesSeparate cache from ggml-llm (lives in {cache_dir}/mlx-session-cache)

Example

[[generators]]
type = "mlx-llm"
[generators.model]
repo_id = "mlx-community/Qwen2.5-VL-3B-Instruct-4bit"
vlm = true
download = true

Programmatic Usage

import { startServer } from '@fugood/buttress-server'

startServer({
  port: 3000,
  defaultConfig: {
    runtime: {
      cache_dir: './.buttress-cache'
    },
    generators: [
      {
        type: 'ggml-llm',
        model: {
          repo_id: 'ggml-org/gemma-3-270m-qat-GGUF',
          quantization: 'mxfp4',
        }
      }
    ]
  }
})
  .then(({ port }) => {
    console.log(`Server running on port ${port}`)
  })
  .catch(console.error)

Environment Variable Priority

Environment variables can be set in the [env] section of the TOML config. These values will only be applied if the environment variable is not already set in the system. This allows:

  • Default values in config file
  • System environment variables to override config values
  • Command-line exports to have highest priority

Example:

# Config has: [env] HF_TOKEN = "default_token"

# This will use the system env variable (highest priority)
HF_TOKEN=my_token npx bricks-buttress

# This will use the config value
npx bricks-buttress

Port Priority

Port can be configured via multiple sources (highest priority first):

  • Command-line flag: --port 3000
  • Config file: [server] port = 2080
  • Default: 2080

CLI Reference

bricks-buttress v2.23.0-beta.22

Buttress server for remote inference with GGML backends.

Usage:
  bricks-buttress [options]

Options:
  -h, --help                    Show this help message
  -v, --version                 Show version number
  -p, --port <port>             Port to listen on (default: 2080)
  -c, --config <path|toml>      Path to TOML config file or inline TOML string

Testing Options:
  --test-caps <backend>         Test model capabilities (ggml-llm or ggml-stt)
  --test-caps-model-id <id>     Model ID to test (used with --test-caps)
  --test-models <ids>           Comma-separated list of model IDs to test
  --test-models-default         Test default set of models

  Note: --test-models and --test-models-default output a markdown report
        file (e.g., ggml-llm-model-capabilities-YYYY-MM-DD.md)

Environment Variables:
  NODE_ENV                      Set to 'development' for dev mode

Examples:
  bricks-buttress
  bricks-buttress --port 3000
  bricks-buttress --config ./config.toml
  bricks-buttress --test-caps ggml-llm --test-models-default
  bricks-buttress --test-caps ggml-stt --test-caps-model-id BricksDisplay/whisper-ggml:ggml-small.bin

Compatibility Endpoints (Experimental)

The server can expose OpenAI- and Anthropic-compatible HTTP endpoints in addition to the native RPC. Each endpoint is opt-in via the TOML config:

[openai_compat]
enabled = true
# cors_allowed_origins = "*"          # Or a list of origins; defaults to disabled

[anthropic_messages]
enabled = true
# cors_allowed_origins = ["http://localhost:3000"]
EndpointConfig flag
/oai-compat/v1/*[openai_compat] enabled = true
/anthropic-messages[anthropic_messages] enabled = true

Session State Cache

The server supports session state caching for ggml-llm generators, which saves KV cache state to disk after completions. This enables:

  • Prompt reuse: Same or similar prompts can reuse cached state, skipping prompt processing
  • Multi-turn conversations: Conversation history state is preserved across requests

Configuration

[runtime.session_cache]
enabled = true                  # Enable/disable session caching (default: true)
max_size_bytes = "10GB"         # Supports string (e.g., "10GB", "500MB") or number (default: 10GB)
max_entries = 1000              # Max number of cached entries (default: 1000)

How it works

  • After a successful completion, the KV cache state is saved to disk
  • On new completions, the server checks if any cached state matches the prompt prefix
  • If a match is found, the cached state is loaded, skipping redundant prompt processing
  • LRU eviction removes oldest entries when limits are exceeded

Cache location

Cache files are stored in {cache_dir}/.session-state-cache/:

  • cache-map.json - Index of cached entries
  • states/ - Binary state files
  • temp/ - Temporary files (auto-cleaned after 1 hour)

Tips

  • macOS: Use sudo sysctl iogpu.wired_limit_mb=<number> to increase GPU memory allocation. The default available memory of GPU is about ~70%. For example, if the hardware have 128GB memory, you can use sudo sysctl iogpu.wired_limit_mb=137438 to increase to 128GB. Run sudo sysctl iogpu.wired_limit_mb=0 if you want to back to default.

Keywords

BRICKS

FAQs

Package last updated on 01 Jun 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts