
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
termollama
Advanced tools
A Linux command line utility for Ollama, a user friendly Llamacpp wrapper. It displays info about gpu vram usage and models and has these additional features:
The nvidia-smi command should be available on the system in order to display gpu info.
Install:
npm i -g termollama
# to update:
npm i -g termollama@latest
Or just run it with npx:
npx termollama
The olm command is now available.
Run the olm command without any argument to display memory stats. Output:

Note the action bar at the bottom with quick actions shortcuts: it will stay on the screen for 5 seconds and disapear. It allows quick actions:
m → Show a memory chartl → Load modelsu → Unload modelsTo monitor the activity in real time:
olm -w
-m, --max-model-bars <number>: Set the maximum number of model bars to display. Defaults to OLLAMA_MAX_LOADED_MODELS if set, otherwise 3 × number of GPUs.TERMOLLAMA_TEMPS: Set temperature thresholds as comma-separated values (low, mid, high) for color-coding.export TERMOLLAMA_TEMPS="30,55,75"
TERMOLLAMA_POWER: Set power usage threshold percentage for color-coding.export TERMOLLAMA_POWER="20"
To list all the available models:
olm models
# or
olm m
To search for a model with filters:
olm m stral
+--------------------------------------------------------+--------+---------+----------+
| Model | Params | Quant | Size |
+--------------------------------------------------------+--------+---------+----------+
| devstral:24b-small-2505-q8_0 | 23.6B | Q8_0 | 23.3 GiB |
+--------------------------------------------------------+--------+---------+----------+
| devstral32k:latest | 23.6B | Q8_0 | 23.3 GiB |
+--------------------------------------------------------+--------+---------+----------+
| hf.co/unsloth/Devstral-Small-2507-GGUF:Q8_K_XL | 23.6B | unknown | 27.8 GiB |
+--------------------------------------------------------+--------+---------+----------+
| mistral-nemo:latest | 12.2B | Q4_0 | 6.6 GiB |
+--------------------------------------------------------+--------+---------+----------+
| mistral-small:latest | 23.6B | Q4_K_M | 13.3 GiB |
+--------------------------------------------------------+--------+---------+----------+
| mistral-small3.1:24b | 24.0B | Q4_K_M | 14.4 GiB |
+--------------------------------------------------------+--------+---------+----------+
| mistral-small3.2:latest | 24.0B | Q4_K_M | 14.1 GiB |
+--------------------------------------------------------+--------+---------+----------+
List all the models and select some to load:
olm load
# or
olm l
You can specify optional parameters when loading:
--ctx or -c: Set the context window (e.g., 2k, 4k, 8192).--keep-alive or -k: Set the keep alive timeout (e.g., 5m, 2h).--ngl or -n: Number of GPU layers to load.Basic load with search:
olm l qw
This searches for models containing "qw" and lets you select from the filtered list. Example output:

Load with context and keep alive:
olm load --ctx 8k --keep-alive 1h mistral
Search for "mistral" models and load with an 8k context window and a 1 hour keep alive time.
Specify GPU layers:
olm l --ngl 40 qwen3:30b
Loads qwen3:30b model with 40 GPU layers, the rest will go to ram.
Filters can be combined (e.g., olm l qwen3 4b finds models with both terms). The selected models are loaded into memory with interactive prompts for parameters if not specified via flags.
To unload models:
olm unload
# or
olm u
Pick the models to unload from the list.
A serve command is available, equivalent to ollama serve but with flag options.
olm serve
# or
olm s
Serve command options directly map to environment variables (they are changed within the process only):
| Option Flag | Environment Variable |
|---|---|
--flash-attention | OLLAMA_FLASH_ATTENTION |
--kv-4 | OLLAMA_KV_CACHE_TYPE=q4_0 |
--kv-8 | OLLAMA_KV_CACHE_TYPE=q8_0 |
--keep-alive | OLLAMA_KEEP_ALIVE |
--ctx | OLLAMA_CONTEXT_LENGTH |
--max-loaded-models | OLLAMA_MAX_LOADED_MODELS |
Options of olm serve:
--flash-attention or -f flag to enable--kv-4 or -4 (note: this flag will turn flash attention on)--kv-8 or -8 (note: this flag will turn flash attention on)--cpu flag to run only on cpu--gpu 0 1 or -g 0 1--keep-alive 1h or -k 1h-ctx 8192 or -c 8192--max-loaded-models 4 or -m 4--max-queue 50 or -q 50--num-parallel 2 or -n 2--port 11485 or -p 11485--host 192.168.1.8--registry ~/some/path/ollama_models or -r ~/some/path/ollama_models-f-4 → q4_0 quantization (low memory)-8 → q8_0 quantization (balanced)--cpu → Run on CPU only-g 0 1 → Use specific GPUs (e.g., GPUs 0 and 1)-k 15m → Keep alive timeout-c 8192 → Default context length-p 11434 → Port (default 11434)-h 0.0.0.0 → Host addressolm s -fg 0
Run with flash attention on GPU 0 only
olm s -c 8192 --cpu
Run with a default context window of 8192 using only the cpu
olm s -8k 10m -m 4
Use fp8 kv cache (flash attention will be used), models will stay loaded for ten minutes and a max of 4 models can be loaded at the same times
olm s -p 11385 -r ~/some/path/ollama_models
Run on localhost:11385 with a custom models registry directory: use an empty directory to create a new registry
To show the environment variables used by Ollama:
olm env
# or
olm e
| Variable | Description |
|---|---|
OLLAMA_FLASH_ATTENTION | Enable flash attention (1 to enable) |
OLLAMA_KV_CACHE_TYPE | Set KV cache quantization (e.g. q4_0, q8_0) |
OLLAMA_KEEP_ALIVE | Default keep alive timeout (e.g. 5m, 2h) |
OLLAMA_CONTEXT_LENGTH | Default context window length (e.g. 4096) |
OLLAMA_MAX_LOADED_MODELS | Maximum number of models to load simultaneously |
OLLAMA_MAX_QUEUE | Maximum request queue size |
OLLAMA_NUM_PARALLEL | Number of parallel requests allowed |
OLLAMA_HOST | Server host address (default localhost) |
OLLAMA_MODELS | Custom models registry directory |
CUDA_VISIBLE_DEVICES | GPU selection (use -1 to force CPU mode) |
To use a different instance than the default localhost:11434:
-u, --use-instance <hostdomain>: Use a specific Ollama instance as the source. Example:
olm models -u 192.168.1.8:11434
This command will list the models from the Ollama instance running at 192.168.1.8 on port 11434.
-s, --use-https: Use HTTPS protocol to reach the Ollama instance.
To show information about gguf models located in the Ollama internal registries:
olm gguf
# or
olm g
This will display information about models from the Ollama model storage registries. Ouptut:
--------- Registry hf.co/bartowski ---------
hf.co/bartowski
NousResearch_DeepHermes-3-Llama-3-8B-Preview-GGUF (1 model)
- Q6_K_L
--------- Registry ollama.com ---------
ollama.com
deepseek-coder-v2 (1 model)
- 16b-lite-instruct-q8_0
--------- Registry registry.ollama.ai ---------
registry.ollama.ai
gemma3 (3 models)
- 12b
- 27b
- 4b-it-q8_0
...
To show information about a specific model:
olm gguf -m qwen3:0.6b
Output:
Model qwen3:0.6b found in registry registry.ollama.ai
size: 498.4 MiB
quant: Q4_K_M
blob: /home/me/.ollama/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fx
link: ln -s /home/me/.ollama/blobs/sha256-7f4030143c1c477224c5434f8272c662a8b042079a0a584f0a27a1684fe2e1fx qwen3_0.6b_Q4_K_M.gguf
The link can be used to create a regular gguf file name symlink from the blob, and use it with Llamacpp and friends.
To show a model's template:
olm gguf -t qwen3:0.6b
To exfiltrate a model blob to a gguf file:
olm gguf -x qwen3:0.6b /path/to/destination
This command will copy the model data from its original location to the specified destination, rename it to a .gguf file, and replace the original blob with a symlink pointing to the new file. Use case: to move the model to another storage location. Use at your own risks.
To only copy a model blob without replacing the original:
olm gguf -c qwen3:0.6b /path/to/destination
This command will perform the same steps as the exfiltrate command but will not replace the original blob with a symlink.
FAQs
A linux command line utility for Ollama
We found that termollama demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.