The cheapest way to look at images with an LLM: $0.15 flat per 1M tokens, both directions.
Pixtral 12B answers one question cheaply: "what's in this image?" At a flat $0.15 per 1M tokens — the same rate in and out, the only flat-priced model on our table — it makes million-image pipelines affordable in a way frontier vision pricing never will.
It's a 12-billion-parameter open-weights model, which sets expectations correctly: this is a tool for volume vision tasks, not a frontier brain that happens to see. The open release also means you can self-host it on a single GPU when API economics stop making sense.
Solid image understanding — captioning, OCR-ish reading, chart and screenshot description, content tagging — plus ordinary text chat in the same call. The 128K window fits dozens of images per request, useful for batch processing.
The honest weakness: detail and reasoning. Fine-grained chart analysis, dense document layouts, and visual reasoning chains belong to Opus 4.8, Gemini 3.1 Pro, or GPT-5.5 — at 30–300× the price.
| Model | Input / 1M | Output / 1M | Context |
|---|---|---|---|
| Pixtral 12B | $0.15 | $0.15 | 128K |
| Gemini 3.1 Flash-Lite | $0.25 | $1.50 | 1M |
| GPT-5.4 mini | $0.25 | $2 | 272K |
| Claude Haiku 4.5 | $1 | $5 | 200K |
| Mistral Small 3.1 | $0.20 | $0.60 | 128K |
Every multimodal rival charges more on output — Gemini Flash-Lite 10×, GPT-5.4 mini 13×. For pure "describe/tag/filter this image" volume, Pixtral is the price floor. The moment the task becomes "reason about this image", spend up.