How to Get Faster Responses from LLMs
First you make it work. Then you make it fast. If you're building with LLMs, chances are you've already got the first part down. Your app works, the outputs are solid, but every response takes just a little too long. And users don't wait around. We live in a world where everything is instant. A 3-second loading spinner feels like an eternity. Here's what actually moves the needle.
1. Send Less In
Every token you send gets processed before the model even starts replying. So trim the fat.
- Only include what's relevant. Don't dump a whole document when you need one paragraph analyzed.
- Use compact formats. If you're sending structured data, check out TOON — it cuts ~40% of tokens compared to JSON by dropping repeated keys and unnecessary punctuation. One thing though: TOON works best with flat, tabular data. For deeply nested stuff, JSON might still be better. Test with your actual data.
- Cut boilerplate from your prompts. Every filler word costs time.
2. Ask for Less Out
Token generation is the slowest part. The less you ask for, the faster you get a reply.
- Set
max_tokensto what you actually need. - Be explicit: "answer in 2-3 sentences" or "under 100 words."
- One question per request. Don't bundle five things into one prompt.
3. Use the Right Model and Not Just the Biggest One
A good prompt on a small model often beats a lazy prompt on a big one. Faster and cheaper too.
- Match model to task. Classification doesn't need state of the art reasoning model.
- Spend time on prompt engineering. Clear instructions, examples, and structured output go a long way.
- Benchmark. Small models surprise people all the time.
Take it further with smart routing. Don't pick one model for everything. Build a router that checks how complex a request is and sends it to the right place. Simple stuff (classification, extraction, formatting) hits a small fast model. Hard stuff (reasoning, creative work) goes to the big one. ~80% of requests are simple. Route them to a model that replies in 100-300ms instead of 2-5 seconds. Tools like OpenRouter, LangGraph, and Martian support this.
4. Run Calls in Parallel
If your workflow has multiple independent LLM calls, don't run them one after another. Run them at the same time.
- Analyzing three documents? Fire all three calls at once.
- Need a guardrail check and a generation? Do both simultaneously.
- Building an agent? Map out which steps are actually sequential vs. which can fan out.
This won't speed up any single call. But it can cut your total wait time by 2-5x. LangGraph and most async frameworks handle this natively.
5. Pick Faster Infrastructure
Not all providers are equal. Some are built specifically for speed.
- Groq - custom LPU chips. Very fast inference.
- Cerebras 0 wafer-scale hardware for high throughput.
- Together AI, Fireworks AI - optimized serving for open-weight models.
Always benchmark for your specific use case. Latency varies a lot by model, input size, and provider.
If you self-host, hardware upgrades matter. Going from an A100 to an H100 can cut latency in half just from better memory bandwidth.
6. Use API Features Built for Speed
Most APIs have features you're probably not using.
- Streaming - tokens appear as they're generated. No waiting for the full response.
- Prompt caching -the provider skips reprocessing your system prompt on every call. Huge for repeated contexts.
- Batch API - if latency per request doesn't matter, batching improves overall throughput.
7. Cache Responses
This is different from prompt caching. This is you storing LLM responses so you don't have to make the same call twice.
- Exact-match caching - same input, same output. Simple and effective.
- Semantic caching - uses embeddings to match similar queries to cached answers, even if the words are different. Teams typically see 30-50% hit rates. Cached responses come back in microseconds.
If your users ask similar questions a lot (support bots, FAQ-style queries, repeated lookups) this can eliminate LLM calls entirely for a big chunk of traffic. Set expiration policies so stale answers don't stick around.
8. Optimize Inference (Self-Hosted)
Running your own models? These techniques go beyond just picking a smaller one.
- Quantization - shrink weights from 16-bit to 8-bit or 4-bit. Get 2-4x faster inference with barely any quality loss. Use GPTQ, AWQ, or bitsandbytes.
- Speculative decoding - a small "draft" model guesses several tokens ahead, then the big model verifies them all at once. Up to 3x speedup with zero quality loss. Bad guesses just get thrown away. vLLM and SGLang support this out of the box.
- KV caching - stores attention scores so the model doesn't redo work for tokens it already processed. Big win for long sequences.
- Continuous batching - process requests as they finish instead of waiting for the whole batch. Better throughput without the latency hit of static batching.
9. Make It Feel Faster
Perceived speed matters as much as real speed.
- Stream intermediate steps. Stream intermediate steps. Don't just show the final answer. Show what the model is doing: retrieval, planning, thinking.
- Run agents in the background. If a task is triggered by an event (email, webhook), the user doesn't need to watch it happen. Just notify them when it's done.
- Optimize time-to-first-token. The gap before the first word appears is what people feel as "waiting." Keep it under 200ms and the whole thing feels snappy.
10. Time Your Requests
Minor but real: API speeds vary by time of day and day of week. Shared infrastructure gets busy during peak hours.
If you're running batch jobs or background tasks, shifting them to off-peak hours can shave meaningful time off. Not always possible, but worth knowing.
TL;DR: Send less, ask for less, route to the right model, parallelize what you can, cache aggressively, and pick fast infrastructure. If you self-host, quantize and try speculative decoding. Don't sleep on perceived speed either. Streaming and showing progress makes everything feel faster. Most people leave easy wins on the table