Donald Zhong | Full-stack developer

Why this is today’s story

On March 20, 2026 (UTC), vLLM shipped v0.18.0, a notably product-facing release that moves beyond raw throughput claims and into deployment architecture: transport choices (gRPC), workload separation (GPU-less render serving), and lower-cost context handling (KV offloading + FlexKV).

What changed in the last 24-72 hours

1) vLLM added first-class gRPC serving

vLLM introduced a new --grpc serving mode alongside HTTP/REST, giving teams a lower-overhead RPC option for internal service-to-service traffic.

2) Multimodal preprocessing can be split off GPUs

The new vllm launch render path enables GPU-less preprocessing/rendering, so expensive accelerators can stay focused on inference while CPU nodes handle input preparation.

3) KV cache handling got more operationally practical

v0.18.0 highlights smarter CPU KV offloading behavior (store frequently reused blocks), plus FlexKV support and multi-KV-group support in offloading specs, which helps teams manage long-context costs without fully overprovisioning GPU memory.

4) Supporting signal: orchestration layers are hardening defaults

Also on March 20, Microsoft Semantic Kernel .NET 1.74.0 shipped breaking security hardening for plugins (deny-by-default directory behavior and tighter plugin defaults). That reinforces a broader product trend: production AI stacks are emphasizing safer defaults and operational controls, not only model capability.

Why it matters for product teams

Architecture is becoming the differentiator: Teams that separate preprocess, inference, and tool orchestration can scale cheaper than “single giant inference box” deployments.
Latency budgets are now protocol-level decisions: gRPC vs REST is no longer an infra footnote when agent traffic and tool calls are high volume.
Reliability work is shifting left: Security/default hardening in orchestration frameworks suggests vendors expect more AI features in regulated or enterprise settings.

Practical takeaway this week

If you run self-hosted or hybrid inference, this is a good week to benchmark a split pipeline:

Preprocess on CPU nodes.
Reserve GPU pools for generation.
Compare REST vs gRPC p95 latency and infra cost at identical traffic.

Primary sources

vLLM release v0.18.0 (published 2026-03-20): https://github.com/vllm-project/vllm/releases/tag/v0.18.0
Semantic Kernel .NET 1.74.0 (published 2026-03-20): https://github.com/microsoft/semantic-kernel/releases/tag/dotnet-1.74.0