ClawKit Logo
ClawKitReliability Toolkit

Fix: Ollama Local Model Stuck Thinking (CPU / No GPU)

Permanent Thinking β€” No Error, No Response

OpenClaw shows a spinning "thinking" indicator and never delivers a response. There's no error message β€” the request reaches Ollama, but Ollama is too slow to respond before the timeout fires.

CPU inference is orders of magnitude slower than GPU. A 7B model that responds in 3 seconds on a GPU can take 5–10 minutes on a modern CPU β€” long past OpenClaw's default timeout. The fix is either a smaller model or a longer timeout (or both).

Next Step

Fix now, then reduce repeat incidents

If this issue keeps coming back, validate your setup in Doctor first, then harden your config.

What You See

OpenClaw chat: thinking… (spins forever, no error)
openclaw logs: [llm] request sent to ollama β€” no response yet (60s)
ollama logs: llama_model_load: loading model… (still loading)

Fix A: Switch to a Smaller Model

This is the most impactful change. On CPU-only systems, model size directly controls whether the tool is usable at all:

llama3.2:1b~800 MB5–30s responseGood
qwen2.5:1.5b~1 GB10–45s responseGood
phi3:mini~2.3 GB20–90s responseAcceptable
llama3.2:3b~2 GB30–120s responseSlow
llama3.1:8b~4.7 GB5–15 min responseImpractical

Pull a smaller model and update your config:

Pull a CPU-friendly model
ollama pull llama3.2:1b
# or
ollama pull qwen2.5:1.5b
openclaw.json β€” Use Smaller Model
{
  "llm": {
    "provider": "ollama",
    "baseUrl": "http://localhost:11434",
    "model": "llama3.2:1b"
  }
}

Fix B: Increase the Timeout

If you want to keep using a larger model and accept the slower speed, increase requestTimeout to give Ollama enough time to respond:

openclaw.json β€” Long Timeout for CPU
{
  "llm": {
    "provider": "ollama",
    "baseUrl": "http://localhost:11434",
    "model": "llama3.1:8b",
    "requestTimeout": 600000
  }
}

600000 is 10 minutes. This is a last resort β€” the UX will be very slow. Combining a smaller model with a 120s timeout is a much better user experience.

Fix C: Verify Ollama Is Actually Running

Before tuning timeouts, confirm Ollama is responsive:

Check Ollama status
# Is Ollama running?
curl http://localhost:11434/api/tags

# Test a real inference (watch the timing)
time curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.2:1b","prompt":"say hi","stream":false}'

/api/tags returns JSON list

If this fails, Ollama is not running β€” start it with: ollama serve

/api/generate responds (even if slow)

This confirms model loading and inference works end-to-end

The model you configured is in the list

Run: ollama list β€” the exact model name must match openclaw.json

Fix D: Model Name Must Be Exact

If the model name in openclaw.json doesn't match what's pulled in Ollama, the request silently fails. Check the exact name:

List downloaded models
ollama list

Use the exact name from the NAME column β€” including the tag (e.g. llama3.2:1b, not just llama3.2). If the tag is latest, you can omit it or include it β€” either works.

CPU Performance Expectations

CPU inference is not fast β€” set realistic expectations

A 2024 AMD Ryzen 9 CPU can do roughly 10–15 tokens/sec on a 1B model. A typical response is 100–300 tokens, so expect 10–30 seconds per reply. This is fine for occasional use, not great for interactive chat.

Run the Doctor

npx clawkit-doctor@latest

Checks Ollama service status, model availability, and response time.

Did this guide solve your problem?