
Research
/Security News
Mini Shai-Hulud Campaign Hits Red Hat Cloud Services npm Packages
A mini Shai-Hulud campaign compromised Red Hat Cloud Services npm packages to steal developer and CI/CD secrets during installation.
@fugood/buttress-server
Advanced tools
A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.
A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.
npm install -g @fugood/buttress-server
# Start with config file
npx bricks-buttress --config ./config.toml
# Start without config (uses env vars and defaults)
npx bricks-buttress
bricks buttress)By default, a buttress-server runs in public mode: any client on the LAN can connect, no auth required. To restrict access to a single BRICKS workspace and enable workspace-scoped JWT auth, bind the server with the bricks buttress CLI commands. Once bound, the server only accepts WebSocket / file-transfer requests carrying a valid access token signed by that workspace's issuer.
The bricks CLI is the tool that performs the binding and writes the local state file. Install it first — see the bricks-cli docs — then bricks auth login with the workspace owner's account before running the commands below.
# Pair the local machine's buttress-server with the workspace of the current bricks-cli profile
bricks buttress bind
# Override the auto-detected server id, give it a friendly name, or write to a custom state dir
bricks buttress bind --server-id buttress-mac-studio --name "Studio LLM" --state-dir /etc/buttress
# For headless/remote setups: emit state.json to stdout instead of writing to disk
bricks buttress bind --print > /etc/buttress/state.json
The state file (~/.bricks-cli/buttress/state.json by default, or $BRICKS_BUTTRESS_STATE_DIR) stores:
workspace.id / workspace.name — which workspace this server belongs toworkspace.serverId — the server's stable id (defaults to buttress-<machineId>)workspace.issuerPublicKey + workspace.kid — Ed25519 SPKI used to verify access tokensRestart bricks-buttress after binding for the change to take effect — the state file is read once at startup.
# Show local state.json + the workspace-side bound list
bricks buttress status
# Same, JSON-formatted
bricks buttress status --json
# UDP scan + HTTP /buttress/info verification (3s timeout by default)
bricks buttress scan
# UDP only (skip the /buttress/info round-trip)
bricks buttress scan --udp-only
# Machine-readable
bricks buttress scan --json
scan lists every buttress-server visible on the LAN, including unbound (public) ones, with their version, auth state (open vs JWT required + kid), bound workspace, and per-generator hardware caps (score, GPU, usable memory). Servers whose workspace matches your current bricks-cli profile are highlighted; this is purely a discovery command and does not mint any tokens.
# Remove the binding from the workspace and delete the local state.json
bricks buttress unbind
# Keep the local state file (useful if you only want to revoke server-side)
bricks buttress unbind --keep-local
After unbinding, restart the server to return it to public mode.
For headless callers (CI, ctor agents) that already hold a workspace token, mint a long-lived buttress access token instead of relying on a per-launcher session token:
# Default 30-day TTL
bricks buttress issue-token
# Custom TTL (seconds), JSON output for scripting
bricks buttress issue-token --ttl 3600 --json
The token claims { k: 'ba', w_id, st: 'ws', sid, jti, exp } and any buttress-server bound to the same workspace will accept it.
Configuration is loaded from a TOML file passed via --config / -c. Every top-level table is optional — missing sections fall back to defaults. See config/sample.toml for an end-to-end example.
| Section | Purpose |
|---|---|
[env] | Environment variables exported into the process only if not already set |
[server] | HTTP/RPC listener (port, log level, body limits) |
[runtime] | Global defaults shared by every generator (most [generators.model] keys may live here too) |
[runtime.session_cache] | KV-cache reuse store — see Session State Cache |
[autodiscover] | LAN UDP / HTTP / mDNS discovery toggles |
[openai_compat] | Enable /oai-compat/v1/* — see Compatibility Endpoints |
[anthropic_messages] | Enable /anthropic-messages — see Compatibility Endpoints |
[[generators]] | Array of generator instances — one entry per loaded model |
[env][env]
HUGGINGFACE_TOKEN = "hf_xxx" # ggml backends read this; HF_TOKEN is not picked up automatically
CUDA_VISIBLE_DEVICES = "0"
Values here are exported only when the variable isn't already set in the process — see Environment Variable Priority. For HuggingFace auth across all backends, [runtime] huggingface_token = "hf_xxx" works regardless of variable name.
[server]| Key | Type | Default |
|---|---|---|
id | string | buttress-<machineId> — stable id used for autodiscover / binding |
name | string | Buttress Server (<short id>) — display name |
port | number | 2080 (overridden by --port) |
log_level | "debug"/"info"/"warn"/"error" | unset |
max_body_size | string|number | "50MB" — e.g. "100MB", "1GB", or raw bytes |
session_timeout | string|number | 60000 ms — accepts ms numbers or duration strings ("30s") |
temp_file_dir | string | $TMPDIR/.buttress |
[runtime] — global generator defaultsMost ggml-llm [generators.model] keys can also live in [runtime] as defaults. Per-generator values win; otherwise the runtime default applies.
| Key | Type | Notes |
|---|---|---|
cache_dir | string | Model + metadata cache root (default ~/.buttress/models) |
huggingface_token | string | Falls back to $HUGGINGFACE_TOKEN |
http_headers | table | Extra headers attached to HF / HTTP downloads |
context_release_delay_ms | number | Idle time before unloading a context (default 10000; 0 = immediate) |
prefer_variants | string[] | Override variant probe order (ggml backends) |
n_threads | number | CPU thread count |
n_ctx | number | Context window (per-model value wins; auto-capped at training context) |
n_gpu_layers | number|"auto" | Layers offloaded to GPU (default "auto") |
n_batch / n_ubatch | number | Prompt batch / micro-batch size. Note: n_batch has a model-level default of 512 that shadows the runtime value unless [generators.model] n_batch is set explicitly. |
n_parallel | number | Parallel sequences (default 4) |
n_cpu_moe | number | MoE expert layers offloaded to CPU |
flash_attn_type | "on" / "off" / "auto" | When a GPU backend is selected, defaults to "auto"; on CPU, defaults to "off". Explicit "on" / "off" / "auto" overrides. |
cache_type_k, cache_type_v | string | KV-cache dtype (f16, f32, q8_0, q4_0, …) |
kv_unified | boolean | Use a unified KV cache across sequences |
swa_full | boolean | Materialize full attention even for sliding-window layers |
ctx_shift | boolean | Allow llama.cpp's rolling context shift |
use_mmap, use_mlock | boolean | Memory-mapping / locking |
no_extra_bufts | boolean | Disable extra compute buffer types |
cpu_mask, cpu_strict | string / boolean | CPU affinity (advanced) |
devices | string[] | Restrict to specific GGML devices |
| Speculative keys | various | speculative, spec_type, spec_draft_n_max/n_min/p_min/p_split |
[autodiscover]Set autodiscover = true for defaults, false (or omit) to disable, or a table for fine control:
[autodiscover]
udp.port = 8089
udp.announcements = { enabled = true, interval = 5000 }
udp.requests = { enabled = true, responseDelay = 100 }
http.enabled = true
http.path = "/buttress/info"
http.cors = true
# mdns.enabled = false # Bonjour/Avahi advertisement (optional)
[[generators]]Every generator entry has a type, an optional [generators.backend] table, and a [generators.model] table:
[[generators]]
type = "ggml-llm" # or "ggml-stt" / "mlx-llm"
[generators.backend]
# (see per-type sections below)
[generators.model]
repo_id = "..."
# (see per-type sections below)
[generators.model] keysShared by all generator types:
| Key | Type | Notes |
|---|---|---|
repo_id (required) | string | HuggingFace repo (org/repo) |
revision | string | Default "main" |
download | boolean | Pre-download at server startup (default false) |
Additional keys honored by ggml-llm and ggml-stt (mlx-llm gets quantization from the repo itself and does not use these):
| Key | Type | Notes |
|---|---|---|
filename | string | Pin a specific artifact in the repo |
url | string | Direct download URL (skips manifest lookup) |
quantization | string | Preferred quant tag — e.g. q4_0, q8_0, mxfp4 |
preferred_quantizations | string[] | Ordered fallback list when quantization doesn't match (alias: quantizations) |
allow_local_file | boolean | Required to use local_path / mmproj_local_path |
local_path | string | Use a local file as the load path. Repo metadata is still resolved from HF, so repo_id is still required. |
api_base, base_url | string | Override HF API / blob hosts (mirrors / proxies) |
ggml-llm (llama.cpp via @fugood/llama.node)Loads a GGUF LLM. Runtime keys above can be overridden per-generator under [generators.model]; [generators.backend] only controls backend selection and resource planning.
[generators.backend]
| Key | Type | Default | Notes |
|---|---|---|---|
variant | string | auto | Force cuda / vulkan / snapdragon / default |
variant_preference | string[] | ["cuda","vulkan","snapdragon","default"] | Probe order when variant is unset |
gpu_memory_fraction | number | 0.85 | Max GPU fraction the hardware guardrails may plan against |
cpu_memory_fraction | number | 0.5 | Max RAM fraction for CPU-side buffers |
[generators.model] — in addition to the common keys above:
| Key | Type | Notes |
|---|---|---|
n_ctx | number | Context window. Auto-capped at the model's training context. |
n_gpu_layers | number|"auto" | Layers offloaded to GPU (default "auto") |
n_batch | number | Prompt batch size (default 512) |
n_ubatch, n_threads, n_parallel, n_cpu_moe | number | Same semantics as the [runtime] defaults |
flash_attn_type, cache_type_k, cache_type_v, kv_unified, swa_full, ctx_shift, use_mmap, use_mlock, no_extra_bufts, cpu_mask, cpu_strict, devices | various | Per-model overrides for the [runtime] defaults |
Multimodal (mtmd) — auto-downloads the matching mmproj-*.gguf from the same repo and calls initMultimodal:
| Key | Type | Notes |
|---|---|---|
enable_mtmd | boolean | Default false |
mmproj_filename | string | Pin a specific projector file |
mmproj_url | string | Direct URL override |
mmproj_local_path | string | Local projector (requires allow_local_file = true) |
mmproj_use_gpu | boolean | null = auto (true when n_gpu_layers > 0) |
mmproj_image_min_tokens | number | Min visual tokens (dynamic-resolution models; -1 = unset) |
mmproj_image_max_tokens | number | Max visual tokens (-1 = unset) |
Speculative decoding
| Key | Type | Notes |
|---|---|---|
speculative | string | Draft model identifier |
spec_type | string | Strategy (backend-defined) |
spec_draft_n_max | int | Max drafted tokens per step |
spec_draft_n_min | int | Min drafted tokens |
spec_draft_p_min | number | Min acceptance probability |
spec_draft_p_split | number | Split threshold |
Example
[[generators]]
type = "ggml-llm"
[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
gpu_memory_fraction = 0.95
[generators.model]
repo_id = "ggml-org/gpt-oss-20b-GGUF"
quantization = "mxfp4"
n_ctx = 12800
download = true
ggml-stt (whisper.cpp via @fugood/whisper.node)Loads a Whisper GGML model for speech-to-text.
[generators.backend]
| Key | Type | Default | Notes |
|---|---|---|---|
variant | string | auto | cuda / vulkan / default |
variant_preference | string[] | ["cuda","vulkan","default"] | Probe order |
gpu_memory_fraction | number | 0.85 | |
cpu_memory_fraction | number | 0.5 |
[generators.model] — common keys plus:
| Key | Type | Default | Notes |
|---|---|---|---|
repo_id | string | "BricksDisplay/whisper-ggml" | Defaulted (unlike ggml-llm) |
preferred_quantizations | string[] | ["q8_0", <no-quant>, "q5_1"] | Default fallback chain |
use_gpu | boolean | true | Force-disable GPU even when available |
use_flash_attn | "on" / "off" / "auto" / boolean | "auto" | "auto" enables flash-attn when GPU is in use. true/false are accepted as shortcuts for "on"/"off". |
Runtime extras — under [runtime] for ggml-stt only:
| Key | Type | Notes |
|---|---|---|
max_threads | number | Caps the whisper.cpp thread count |
Example
[[generators]]
type = "ggml-stt"
[generators.backend]
variant_preference = ["cuda", "vulkan", "default"]
[generators.model]
repo_id = "BricksDisplay/whisper-ggml"
filename = "ggml-large-v3-turbo-q8_0.bin"
use_gpu = true
use_flash_attn = "on"
download = true
mlx-llm (Apple Silicon, Python mlx-lm / mlx-vlm bridge)Loads an MLX-format model on Apple Silicon. On first use, the backend creates a virtualenv at {cache_dir}/mlx-env and installs mlx_lm_package, mlx_vlm_package, plus torch and torchvision (required by some VLM processors). If an existing venv already has mlx_vlm and torch importable, the install step is skipped. There is no [generators.backend] section.
[generators.model] — common repo_id / revision / download plus:
| Key | Type | Default | Notes |
|---|---|---|---|
adapter_path | string | — | Local LoRA adapter directory |
vlm | "auto" / boolean | "auto" | Force VLM (true) vs text-only (false); "auto" infers from the repo |
tokenizer_config | table | — | Forwarded to mlx_lm.load(..., tokenizer_config=...) |
model_config | table | — | Forwarded to mlx_lm.load(..., model_config=...) |
quantization, filename, and preferred_quantizations are not used — the MLX repo itself determines the quantization.
Runtime extras — under [runtime] for mlx-llm:
| Key | Type | Default | Notes |
|---|---|---|---|
mlx_env_dir | string | {cache_dir}/mlx-env | Location of the auto-managed Python venv |
mlx_lm_package | string | "mlx-lm==0.31.1" | pip spec used when provisioning the venv |
mlx_vlm_package | string | "mlx-vlm==0.4.0" | pip spec used when provisioning the venv |
session_cache.* | table | enabled, 5GB, 100 entries | Separate cache from ggml-llm (lives in {cache_dir}/mlx-session-cache) |
Example
[[generators]]
type = "mlx-llm"
[generators.model]
repo_id = "mlx-community/Qwen2.5-VL-3B-Instruct-4bit"
vlm = true
download = true
import { startServer } from '@fugood/buttress-server'
startServer({
port: 3000,
defaultConfig: {
runtime: {
cache_dir: './.buttress-cache'
},
generators: [
{
type: 'ggml-llm',
model: {
repo_id: 'ggml-org/gemma-3-270m-qat-GGUF',
quantization: 'mxfp4',
}
}
]
}
})
.then(({ port }) => {
console.log(`Server running on port ${port}`)
})
.catch(console.error)
Environment variables can be set in the [env] section of the TOML config. These values will only be applied if the environment variable is not already set in the system. This allows:
Example:
# Config has: [env] HF_TOKEN = "default_token"
# This will use the system env variable (highest priority)
HF_TOKEN=my_token npx bricks-buttress
# This will use the config value
npx bricks-buttress
Port can be configured via multiple sources (highest priority first):
--port 3000[server] port = 20802080bricks-buttress v2.23.0-beta.22
Buttress server for remote inference with GGML backends.
Usage:
bricks-buttress [options]
Options:
-h, --help Show this help message
-v, --version Show version number
-p, --port <port> Port to listen on (default: 2080)
-c, --config <path|toml> Path to TOML config file or inline TOML string
Testing Options:
--test-caps <backend> Test model capabilities (ggml-llm or ggml-stt)
--test-caps-model-id <id> Model ID to test (used with --test-caps)
--test-models <ids> Comma-separated list of model IDs to test
--test-models-default Test default set of models
Note: --test-models and --test-models-default output a markdown report
file (e.g., ggml-llm-model-capabilities-YYYY-MM-DD.md)
Environment Variables:
NODE_ENV Set to 'development' for dev mode
Examples:
bricks-buttress
bricks-buttress --port 3000
bricks-buttress --config ./config.toml
bricks-buttress --test-caps ggml-llm --test-models-default
bricks-buttress --test-caps ggml-stt --test-caps-model-id BricksDisplay/whisper-ggml:ggml-small.bin
The server can expose OpenAI- and Anthropic-compatible HTTP endpoints in addition to the native RPC. Each endpoint is opt-in via the TOML config:
[openai_compat]
enabled = true
# cors_allowed_origins = "*" # Or a list of origins; defaults to disabled
[anthropic_messages]
enabled = true
# cors_allowed_origins = ["http://localhost:3000"]
| Endpoint | Config flag |
|---|---|
/oai-compat/v1/* | [openai_compat] enabled = true |
/anthropic-messages | [anthropic_messages] enabled = true |
The server supports session state caching for ggml-llm generators, which saves KV cache state to disk after completions. This enables:
[runtime.session_cache]
enabled = true # Enable/disable session caching (default: true)
max_size_bytes = "10GB" # Supports string (e.g., "10GB", "500MB") or number (default: 10GB)
max_entries = 1000 # Max number of cached entries (default: 1000)
Cache files are stored in {cache_dir}/.session-state-cache/:
cache-map.json - Index of cached entriesstates/ - Binary state filestemp/ - Temporary files (auto-cleaned after 1 hour)sudo sysctl iogpu.wired_limit_mb=<number> to increase GPU memory allocation. The default available memory of GPU is about ~70%. For example, if the hardware have 128GB memory, you can use sudo sysctl iogpu.wired_limit_mb=137438 to increase to 128GB. Run sudo sysctl iogpu.wired_limit_mb=0 if you want to back to default.FAQs
A high-performance RPC server for managing GGML LLM generators with configurable defaults and runtime management.
We found that @fugood/buttress-server demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 5 open source maintainers collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
/Security News
A mini Shai-Hulud campaign compromised Red Hat Cloud Services npm packages to steal developer and CI/CD secrets during installation.

Research
/Security News
The North Korean malware loader hides in a Packagist-listed package and its GitHub branch to fetch and execute remote code in a likely Contagious Interview-style lure.

Security News
The Rust project is moving toward formal rules on LLM use in contributions after months of internal debate over maintainer burden, code quality, and contributor experience.