WebSockets Made Agentic Coding Loops 40% Faster

The protocol bottleneck

I’ve been building and profiling AI coding agents for a while. One thing that stood out is how much time they waste just talking to the API. Every turn, read a file, run a test, write a fix, involves sending the entire conversation history again. The system prompt, the tool schema, and every past message go back over the wire each time. The payload grows linearly with each call, and the latency piles up just as fast.

At first I figured this was some unavoidable part of how GPU clusters route stateless jobs. It wasn’t. The problem was simpler: HTTP itself.

Loading diagram...

HTTP doesn’t remember anything between calls, so the client has to resend the full context on every request. That’s fine for a chatbot, but terrible for agents that loop through dozens of tool calls.

Switching to a WebSocket-based protocol fixed it. The connection stays open, the server keeps the last response in memory, and I only send what changed. Instead of pushing the entire history, I just send the update and a pointer to the prior response.

// Full HTTP payload (every turn)
await fetch("/api", {
  method: "POST",
  body: JSON.stringify({
    messages: [
      { role: "system", content: "You are a coding agent..." },
      { role: "user", content: "Read index.ts" },
      { role: "assistant", content: "OK, here’s the code..." },
      { role: "user", content: "Fix the type error" }
    ]
  })
});

// Incremental WebSocket payload
socket.send(JSON.stringify({
  previous_response_id: "resp_abc123",
  messages: [{ role: "user", content: "Fix the type error" }]
}));

The change cut loop time by about 40%. For an agent running dozens of short tool calls, that’s the difference between snappy and sluggish.

Where the latency really is

Most people assume the model runtime is the main delay the “thinking” time. In practice, a lot of the wall-clock time disappears into protocol overhead: serializing context, shipping it across the network, and unpacking it again just to process a small delta. It’s wasted motion.

That’s a distributed systems problem, not an “AI problem.” Stateless protocols are simple and reliable; stateful ones are faster but more fragile. It’s the same tradeoff we’ve been dealing with in backend engineering for decades. AI systems are just the newest place it shows up.

What’s next

There’s low-hanging fruit everywhere: connection pooling, batching, delta compression, caching. None of this is new — just web infrastructure applied to a different kind of workload.

The model gets most of the credit, but the system around it sets the real limits. Faster inference helps, but faster I/O, fewer round-trips, and better plumbing matter just as much. That’s the part I’ve been focusing on lately — and it’s where most of the interesting work still is.