Big News: Socket raises $60M Series C at a $1B valuation to secure software supply chains for AI-driven development.Announcement →

@fugood/buttress-server

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

@fugood/buttress-server

A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.

latest

npm

Version: 2.24.6

Version published: 2 days ago

Maintainers: 5

Created: 5 months ago

Source

Buttress Server

A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.

Installation

npm install -g @fugood/buttress-server

Quick Start

Using CLI

# Start with config file
npx bricks-buttress --config ./config.toml

# Start without config (uses env vars and defaults)
npx bricks-buttress

Workspace Binding (`bricks buttress`)

By default, a buttress-server runs in public mode: any client on the LAN can connect, no auth required. To restrict access to a single BRICKS workspace and enable workspace-scoped JWT auth, bind the server with the bricks buttress CLI commands. Once bound, the server only accepts WebSocket / file-transfer requests carrying a valid access token signed by that workspace's issuer.

The bricks CLI is the tool that performs the binding and writes the local state file. Install it first — see the bricks-cli docs — then bricks auth login with the workspace owner's account before running the commands below.

Bind a server to a workspace

# Pair the local machine's buttress-server with the workspace of the current bricks-cli profile
bricks buttress bind

# Override the auto-detected server id, give it a friendly name, or write to a custom state dir
bricks buttress bind --server-id buttress-mac-studio --name "Studio LLM" --state-dir /etc/buttress

# For headless/remote setups: emit state.json to stdout instead of writing to disk
bricks buttress bind --print > /etc/buttress/state.json

The state file (~/.bricks-cli/buttress/state.json by default, or $BRICKS_BUTTRESS_STATE_DIR) stores:

workspace.id / workspace.name — which workspace this server belongs to
workspace.serverId — the server's stable id (defaults to buttress-<machineId>)
workspace.issuerPublicKey + workspace.kid — Ed25519 SPKI used to verify access tokens

Restart bricks-buttress after binding for the change to take effect — the state file is read once at startup.

Inspect bindings

# Show local state.json + the workspace-side bound list
bricks buttress status

# Same, JSON-formatted
bricks buttress status --json

Discover servers on the LAN

# UDP scan + HTTP /buttress/info verification (3s timeout by default)
bricks buttress scan

# UDP only (skip the /buttress/info round-trip)
bricks buttress scan --udp-only

# Machine-readable
bricks buttress scan --json

scan lists every buttress-server visible on the LAN, including unbound (public) ones, with their version, auth state (open vs JWT required + kid), bound workspace, and per-generator hardware caps (score, GPU, usable memory). Servers whose workspace matches your current bricks-cli profile are highlighted; this is purely a discovery command and does not mint any tokens.

Unbind

# Remove the binding from the workspace and delete the local state.json
bricks buttress unbind

# Keep the local state file (useful if you only want to revoke server-side)
bricks buttress unbind --keep-local

After unbinding, restart the server to return it to public mode.

Issue a long-lived access token

For headless callers (CI, ctor agents) that already hold a workspace token, mint a long-lived buttress access token instead of relying on a per-launcher session token:

# Default 30-day TTL
bricks buttress issue-token

# Custom TTL (seconds), JSON output for scripting
bricks buttress issue-token --ttl 3600 --json

The token claims { k: 'ba', w_id, st: 'ws', sid, jti, exp } and any buttress-server bound to the same workspace will accept it.

Configuration

Configuration is loaded from a TOML file passed via --config / -c. Every top-level table is optional — missing sections fall back to defaults. See config/sample.toml for an end-to-end example.

Top-level sections

Section	Purpose
`[env]`	Environment variables exported into the process only if not already set
`[server]`	HTTP/RPC listener (port, log level, body limits)
`[runtime]`	Global defaults shared by every generator (most `[generators.model]` keys may live here too)
`[runtime.session_cache]`	KV-cache reuse store — see Session State Cache
`[autodiscover]`	LAN UDP / HTTP / mDNS discovery toggles
`[openai_compat]`	Enable `/oai-compat/v1/*` — see Compatibility Endpoints
`[anthropic_messages]`	Enable `/anthropic-messages` — see Compatibility Endpoints
`[[generators]]`	Array of generator instances — one entry per loaded model

`[env]`

[env]
HUGGINGFACE_TOKEN = "hf_xxx"   # ggml backends read this; HF_TOKEN is not picked up automatically
CUDA_VISIBLE_DEVICES = "0"

Values here are exported only when the variable isn't already set in the process — see Environment Variable Priority. For HuggingFace auth across all backends, [runtime] huggingface_token = "hf_xxx" works regardless of variable name.

`[server]`

Key	Type	Default
`id`	string	`buttress-<machineId>` — stable id used for autodiscover / binding
`name`	string	`Buttress Server (<short id>)` — display name
`port`	number	`2080` (overridden by `--port`)
`log_level`	`"debug"`/`"info"`/`"warn"`/`"error"`	unset
`max_body_size`	string\|number	`"50MB"` — e.g. `"100MB"`, `"1GB"`, or raw bytes
`session_timeout`	string\|number	`60000` ms — accepts ms numbers or duration strings (`"30s"`)
`temp_file_dir`	string	`$TMPDIR/.buttress`

`[runtime]` — global generator defaults

Most ggml-llm [generators.model] keys can also live in [runtime] as defaults. Per-generator values win; otherwise the runtime default applies.

Key	Type	Notes
`cache_dir`	string	Model + metadata cache root (default `~/.buttress/models`)
`huggingface_token`	string	Falls back to `$HUGGINGFACE_TOKEN`
`http_headers`	table	Extra headers attached to HF / HTTP downloads
`context_release_delay_ms`	number	Idle time before unloading a context (default `10000`; `0` = immediate)
`prefer_variants`	string[]	Override variant probe order (ggml backends)
`n_threads`	number	CPU thread count
`n_ctx`	number	Context window (per-model value wins; auto-capped at training context)
`n_gpu_layers`	number\|`"auto"`	Layers offloaded to GPU (default `"auto"`)
`n_batch` / `n_ubatch`	number	Prompt batch / micro-batch size. Note: `n_batch` has a model-level default of `512` that shadows the runtime value unless `[generators.model] n_batch` is set explicitly.
`n_parallel`	number	Parallel sequences (default `4`)
`n_cpu_moe`	number	MoE expert layers offloaded to CPU
`flash_attn_type`	`"on"` / `"off"` / `"auto"`	When a GPU backend is selected, defaults to `"auto"`; on CPU, defaults to `"off"`. Explicit `"on"` / `"off"` / `"auto"` overrides.
`cache_type_k`, `cache_type_v`	string	KV-cache dtype (`f16`, `f32`, `q8_0`, `q4_0`, …)
`kv_unified`	boolean	Use a unified KV cache across sequences
`swa_full`	boolean	Materialize full attention even for sliding-window layers
`ctx_shift`	boolean	Allow llama.cpp's rolling context shift
`use_mmap`, `use_mlock`	boolean	Memory-mapping / locking
`no_extra_bufts`	boolean	Disable extra compute buffer types
`cpu_mask`, `cpu_strict`	string / boolean	CPU affinity (advanced)
`devices`	string[]	Restrict to specific GGML devices
Speculative keys	various	`speculative`, `spec_type`, `spec_draft_n_max/n_min/p_min/p_split`

`[autodiscover]`

Set autodiscover = true for defaults, false (or omit) to disable, or a table for fine control:

[autodiscover]
udp.port = 8089
udp.announcements = { enabled = true, interval = 5000 }
udp.requests       = { enabled = true, responseDelay = 100 }
http.enabled = true
http.path    = "/buttress/info"
http.cors    = true
# mdns.enabled = false   # Bonjour/Avahi advertisement (optional)

`[[generators]]`

Every generator entry has a type, an optional [generators.backend] table, and a [generators.model] table:

[[generators]]
type = "ggml-llm"        # or "ggml-stt" / "mlx-llm"

[generators.backend]
# (see per-type sections below)

[generators.model]
repo_id = "..."
# (see per-type sections below)

Common `[generators.model]` keys

Shared by all generator types:

Key	Type	Notes
`repo_id` (required)	string	HuggingFace repo (`org/repo`)
`revision`	string	Default `"main"`
`download`	boolean	Pre-download at server startup (default `false`)

Additional keys honored by ggml-llm and ggml-stt (mlx-llm gets quantization from the repo itself and does not use these):

Key	Type	Notes
`filename`	string	Pin a specific artifact in the repo
`url`	string	Direct download URL (skips manifest lookup)
`quantization`	string	Preferred quant tag — e.g. `q4_0`, `q8_0`, `mxfp4`
`preferred_quantizations`	string[]	Ordered fallback list when `quantization` doesn't match (alias: `quantizations`)
`allow_local_file`	boolean	Required to use `local_path` / `mmproj_local_path`
`local_path`	string	Use a local file as the load path. Repo metadata is still resolved from HF, so `repo_id` is still required.
`api_base`, `base_url`	string	Override HF API / blob hosts (mirrors / proxies)

`ggml-llm` (llama.cpp via `@fugood/llama.node`)

Loads a GGUF LLM. Runtime keys above can be overridden per-generator under [generators.model]; [generators.backend] only controls backend selection and resource planning.

[generators.backend]

Key	Type	Default	Notes
`variant`	string	auto	Force `cuda` / `vulkan` / `snapdragon` / `default`
`variant_preference`	string[]	`["cuda","vulkan","snapdragon","default"]`	Probe order when `variant` is unset
`gpu_memory_fraction`	number	`0.85`	Max GPU fraction the hardware guardrails may plan against
`cpu_memory_fraction`	number	`0.5`	Max RAM fraction for CPU-side buffers

[generators.model] — in addition to the common keys above:

Key	Type	Notes
`n_ctx`	number	Context window. Auto-capped at the model's training context.
`n_gpu_layers`	number\|`"auto"`	Layers offloaded to GPU (default `"auto"`)
`n_batch`	number	Prompt batch size (default `512`)
`n_ubatch`, `n_threads`, `n_parallel`, `n_cpu_moe`	number	Same semantics as the `[runtime]` defaults
`flash_attn_type`, `cache_type_k`, `cache_type_v`, `kv_unified`, `swa_full`, `ctx_shift`, `use_mmap`, `use_mlock`, `no_extra_bufts`, `cpu_mask`, `cpu_strict`, `devices`	various	Per-model overrides for the `[runtime]` defaults

Multimodal (mtmd) — auto-downloads the matching mmproj-*.gguf from the same repo and calls initMultimodal:

Key	Type	Notes
`enable_mtmd`	boolean	Default `false`
`mmproj_filename`	string	Pin a specific projector file
`mmproj_url`	string	Direct URL override
`mmproj_local_path`	string	Local projector (requires `allow_local_file = true`)
`mmproj_use_gpu`	boolean	`null` = auto (true when `n_gpu_layers > 0`)
`mmproj_image_min_tokens`	number	Min visual tokens (dynamic-resolution models; `-1` = unset)
`mmproj_image_max_tokens`	number	Max visual tokens (`-1` = unset)

Speculative decoding

Key	Type	Notes
`speculative`	string	Draft model identifier
`spec_type`	string	Strategy (backend-defined)
`spec_draft_n_max`	int	Max drafted tokens per step
`spec_draft_n_min`	int	Min drafted tokens
`spec_draft_p_min`	number	Min acceptance probability
`spec_draft_p_split`	number	Split threshold

Example

[[generators]]
type = "ggml-llm"
[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
gpu_memory_fraction = 0.95
[generators.model]
repo_id = "ggml-org/gpt-oss-20b-GGUF"
quantization = "mxfp4"
n_ctx = 12800
download = true

`ggml-stt` (whisper.cpp via `@fugood/whisper.node`)

Loads a Whisper GGML model for speech-to-text.

[generators.backend]

Key	Type	Default	Notes
`variant`	string	auto	`cuda` / `vulkan` / `default`
`variant_preference`	string[]	`["cuda","vulkan","default"]`	Probe order
`gpu_memory_fraction`	number	`0.85`
`cpu_memory_fraction`	number	`0.5`

[generators.model] — common keys plus:

Key	Type	Default	Notes
`repo_id`	string	`"BricksDisplay/whisper-ggml"`	Defaulted (unlike ggml-llm)
`preferred_quantizations`	string[]	`["q8_0", <no-quant>, "q5_1"]`	Default fallback chain
`use_gpu`	boolean	`true`	Force-disable GPU even when available
`use_flash_attn`	`"on"` / `"off"` / `"auto"` / boolean	`"auto"`	`"auto"` enables flash-attn when GPU is in use. `true`/`false` are accepted as shortcuts for `"on"`/`"off"`.

Runtime extras — under [runtime] for ggml-stt only:

Key	Type	Notes
`max_threads`	number	Caps the whisper.cpp thread count

Example

[[generators]]
type = "ggml-stt"
[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
[generators.model]
repo_id = "BricksDisplay/whisper-ggml"
filename = "ggml-large-v3-turbo-q8_0.bin"
use_gpu = true
use_flash_attn = "on"
download = true

`mlx-llm` (Apple Silicon, Python `mlx-lm` / `mlx-vlm` bridge)

Loads an MLX-format model on Apple Silicon. On first use, the backend creates a virtualenv at {cache_dir}/mlx-env and installs mlx_lm_package, mlx_vlm_package, plus torch and torchvision (required by some VLM processors). If an existing venv already has mlx_vlm and torch importable, the install step is skipped. There is no [generators.backend] section.

[generators.model] — common repo_id / revision / download plus:

Key	Type	Default	Notes
`adapter_path`	string	—	Local LoRA adapter directory
`vlm`	`"auto"` / boolean	`"auto"`	Force VLM (`true`) vs text-only (`false`); `"auto"` infers from the repo
`tokenizer_config`	table	—	Forwarded to `mlx_lm.load(..., tokenizer_config=...)`
`model_config`	table	—	Forwarded to `mlx_lm.load(..., model_config=...)`

quantization, filename, and preferred_quantizations are not used — the MLX repo itself determines the quantization.

Runtime extras — under [runtime] for mlx-llm:

Key	Type	Default	Notes
`mlx_env_dir`	string	`{cache_dir}/mlx-env`	Location of the auto-managed Python venv
`mlx_lm_package`	string	`"mlx-lm==0.31.1"`	pip spec used when provisioning the venv
`mlx_vlm_package`	string	`"mlx-vlm==0.4.0"`	pip spec used when provisioning the venv
`session_cache.*`	table	enabled, `5GB`, 100 entries	Separate cache from ggml-llm (lives in `{cache_dir}/mlx-session-cache`)

Example

[[generators]]
type = "mlx-llm"
[generators.model]
repo_id = "mlx-community/Qwen2.5-VL-3B-Instruct-4bit"
vlm = true
download = true

Programmatic Usage

import { startServer } from '@fugood/buttress-server'

startServer({
  port: 3000,
  defaultConfig: {
    runtime: {
      cache_dir: './.buttress-cache'
    },
    generators: [
      {
        type: 'ggml-llm',
        model: {
          repo_id: 'ggml-org/gemma-3-270m-qat-GGUF',
          quantization: 'mxfp4',
        }
      }
    ]
  }
})
  .then(({ port }) => {
    console.log(`Server running on port ${port}`)
  })
  .catch(console.error)

Environment Variable Priority

Environment variables can be set in the [env] section of the TOML config. These values will only be applied if the environment variable is not already set in the system. This allows:

Default values in config file
System environment variables to override config values
Command-line exports to have highest priority

Example:

# Config has: [env] HF_TOKEN = "default_token"

# This will use the system env variable (highest priority)
HF_TOKEN=my_token npx bricks-buttress

# This will use the config value
npx bricks-buttress

Port Priority

Port can be configured via multiple sources (highest priority first):

Command-line flag: --port 3000
Config file: [server] port = 2080
Default: 2080

CLI Reference

bricks-buttress v2.23.0-beta.22

Buttress server for remote inference with GGML backends.

Usage:
  bricks-buttress [options]

Options:
  -h, --help                    Show this help message
  -v, --version                 Show version number
  -p, --port <port>             Port to listen on (default: 2080)
  -c, --config <path|toml>      Path to TOML config file or inline TOML string

Testing Options:
  --test-caps <backend>         Test model capabilities (ggml-llm or ggml-stt)
  --test-caps-model-id <id>     Model ID to test (used with --test-caps)
  --test-models <ids>           Comma-separated list of model IDs to test
  --test-models-default         Test default set of models

  Note: --test-models and --test-models-default output a markdown report
        file (e.g., ggml-llm-model-capabilities-YYYY-MM-DD.md)

Environment Variables:
  NODE_ENV                      Set to 'development' for dev mode

Examples:
  bricks-buttress
  bricks-buttress --port 3000
  bricks-buttress --config ./config.toml
  bricks-buttress --test-caps ggml-llm --test-models-default
  bricks-buttress --test-caps ggml-stt --test-caps-model-id BricksDisplay/whisper-ggml:ggml-small.bin

Compatibility Endpoints (Experimental)

The server can expose OpenAI- and Anthropic-compatible HTTP endpoints in addition to the native RPC. Each endpoint is opt-in via the TOML config:

[openai_compat]
enabled = true
# cors_allowed_origins = "*"          # Or a list of origins; defaults to disabled

[anthropic_messages]
enabled = true
# cors_allowed_origins = ["http://localhost:3000"]

Endpoint	Config flag
`/oai-compat/v1/*`	`[openai_compat] enabled = true`
`/anthropic-messages`	`[anthropic_messages] enabled = true`

Session State Cache

The server supports session state caching for ggml-llm generators, which saves KV cache state to disk after completions. This enables:

Prompt reuse: Same or similar prompts can reuse cached state, skipping prompt processing
Multi-turn conversations: Conversation history state is preserved across requests

Configuration

[runtime.session_cache]
enabled = true                  # Enable/disable session caching (default: true)
max_size_bytes = "10GB"         # Supports string (e.g., "10GB", "500MB") or number (default: 10GB)
max_entries = 1000              # Max number of cached entries (default: 1000)

How it works

After a successful completion, the KV cache state is saved to disk
On new completions, the server checks if any cached state matches the prompt prefix
If a match is found, the cached state is loaded, skipping redundant prompt processing
LRU eviction removes oldest entries when limits are exceeded

Cache location

Cache files are stored in {cache_dir}/.session-state-cache/:

cache-map.json - Index of cached entries
states/ - Binary state files
temp/ - Temporary files (auto-cleaned after 1 hour)

Tips

macOS: Use sudo sysctl iogpu.wired_limit_mb=<number> to increase GPU memory allocation. The default available memory of GPU is about ~70%. For example, if the hardware have 128GB memory, you can use sudo sysctl iogpu.wired_limit_mb=137438 to increase to 128GB. Run sudo sysctl iogpu.wired_limit_mb=0 if you want to back to default.

Keywords

BRICKS

buttress

server

FAQs

What is @fugood/buttress-server?

Is @fugood/buttress-server well maintained?

Package last updated on 01 Jun 2026

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@fugood/buttress-server

Buttress Server

Installation

Quick Start

Using CLI

Workspace Binding (bricks buttress)

Bind a server to a workspace

Inspect bindings

Discover servers on the LAN

Unbind

Issue a long-lived access token

Configuration

Top-level sections

[env]

[server]

[runtime] — global generator defaults

[autodiscover]

[[generators]]

Common [generators.model] keys

ggml-llm (llama.cpp via @fugood/llama.node)

ggml-stt (whisper.cpp via @fugood/whisper.node)

mlx-llm (Apple Silicon, Python mlx-lm / mlx-vlm bridge)

Programmatic Usage

Environment Variable Priority

Port Priority

CLI Reference

Compatibility Endpoints (Experimental)

Session State Cache

Configuration

How it works

Cache location

Tips

Keywords

Related posts

Famous Chollima Targets PHP Developers Through Compromised Packagist Package

Rust Moves to Restrict LLM Use in Contributions After Months of Internal Debate

Workspace Binding (`bricks buttress`)

`[env]`

`[server]`

`[runtime]` — global generator defaults

`[autodiscover]`

`[[generators]]`

Common `[generators.model]` keys

`ggml-llm` (llama.cpp via `@fugood/llama.node`)

`ggml-stt` (whisper.cpp via `@fugood/whisper.node`)

`mlx-llm` (Apple Silicon, Python `mlx-lm` / `mlx-vlm` bridge)