Skip Navigation

Models for 16 GB vram?

What models are currently good for running coding tasks? I just ran Qwen3-14B-Q6_K.gguf with llama.cpp on my card with 16GB of vram (+32GB ddr4), but I get really close to filling the entire vram on a single short conversation, so I am looking for some (smaller) alternatives to test.

I might throw OpenCode container in the mix next, if that is relevant information.

 
    
podman run --rm --replace --pull=newer \
  --name llama \
  -p 8080:8080 \
  -v ./llama_models:/models:Z \
  --device /dev/dri/card1:/dev/dri/card1 \
  --device /dev/dri/renderD128:/dev/dri/renderD128 \
  ghcr.io/ggml-org/llama.cpp:full-vulkan \
  --server \
  -m /models/Qwen3-14B-Q6_K.gguf \
  -ngl 99 \
  -fa on \
  -c 16384 \
  --temp 0.6 \
  --top-k 20 \
  --top-p 0.95 \
  --jinja \
  --host 0.0.0.0 --port 8080


  

Comments

8

Comments

8