LLM Benchmarking Charts

GPU vs CPU Performance

Tokens per second by model size (higher is better)

Framework Performance

OpenVINO vs Ollama across different models

Overall Model Performance Ranking

All tested configurations sorted by tokens/second

GPU Speedup vs CPU

Performance multiplier (1.0x = equal performance)

Performance by Parameter Count

How model size affects inference speed

📊 Key Insights

For 7B models, GPU only provides 1.3x speedup over CPU. The setup complexity often outweighs the marginal performance gains.

CPU tied or beat GPU for models ≤4B parameters. Modern Intel CPUs are highly capable for smaller models.

OpenVINO CPU was 1.6x faster than Ollama for Mistral 7B, but Ollama outperformed OpenVINO by 32% for Llama 3.1 8B.

Qwen3-VL 8B via Ollama achieved 5.14 tok/s—multimodal capabilities with zero GPU complexity.

32GB RAM maxed out at 8B models. Larger models that might benefit more from GPU require more memory.

For most use cases: Use Ollama on CPU. Skip GPU setup unless running sustained high-throughput workloads.