AI Serving

Thothy Research DeskMay 25, 20265 min read

PagedAttention Makes LLM Serving a Memory Scheduling Problem

For Thothy, the practical question is not whether an LLM can answer one prompt, but whether the serving layer can keep many generation and analysis jobs moving without wasting GPU memory.

Proof stack

Evidence Chain

KV cache

Core Bottleneck

The PagedAttention paper identifies dynamic, large KV cache memory as a central obstacle to high-throughput LLM serving.

High throughput

Serving Goal

vLLM describes itself as a high-throughput and memory-efficient inference and serving engine.

PagedAttention

Mechanism

vLLM documents PagedAttention as its attention-kernel design for managing attention computation over paged KV cache blocks.

Memory management

Research Direction

Recent LLM-serving research continues to focus on GPU memory management because larger batches reduce cost pressure during inference.

1LLM throughput depends on batching enough concurrent requests.

2KV cache memory grows dynamically and can limit batching.

3PagedAttention virtualizes KV cache storage so serving becomes a scheduling problem.

Thesis

The serving problem starts after the model works

A local GPU-backed AI server is only useful to Thothy if it can handle bursts of analysis, generation, translation, and report-support work predictably. The technical issue is not just model quality; it is whether the serving system can admit enough simultaneous requests to keep the GPU productive.^[1]

The original vLLM research frames high-throughput LLM serving as a batching problem constrained by memory. Large language model inference needs many requests in flight, but each request carries a key-value cache whose size changes during decoding.^[4]

Mechanism

The KV cache is the hidden growth constraint

During autoregressive decoding, the server stores prior key and value tensors so the model does not recompute attention over the full prefix at every token. That cache is useful, but it becomes a scheduling liability because each request can consume a different and changing amount of memory.^[4]

For Thothy, this maps directly to workload variance: a short product-summary job, a long trend-analysis job, and a translation pass should not force the serving layer into wasteful static allocation. The infrastructure needs memory behavior that tolerates mixed prompt lengths and mixed output lengths.^[4]

Design

PagedAttention borrows the right abstraction

PagedAttention is the core abstraction behind vLLM's memory-efficiency claim. Instead of treating every request's KV cache as one contiguous allocation, the serving system can manage cache storage in blocks, which makes dynamic request growth easier to schedule.^[4]^[5]

The vLLM documentation presents PagedAttention through the attention kernel and its block-oriented handling of cached keys and values. The important product implication is simple: the serving layer can spend less effort fighting fragmentation and more effort keeping useful work batched on the GPU.^[5]

Throughput

Batching becomes an operating discipline

vLLM is positioned as a high-throughput, memory-efficient inference and serving system, not merely as a model wrapper. That distinction matters because Thothy's growth pipeline creates queues of work: trend extraction, hook generation, report drafting, summarization, and translation all compete for the same GPU-backed service.^[1]

Recent memory-management research keeps returning to the same operational pressure: lowering inference cost depends on making larger effective batches possible through better GPU memory management. In that framing, throughput is not a bonus metric; it is the serving system's economic control surface.^[2]

For acquisition, higher serving throughput means fresher trend-backed content can be generated before the topic cools.^[1]^[2]
For retention, predictable serving reduces missed publishing windows and makes recurring analysis assets more reliable.^[4]

Frontier

Long-context serving keeps pressure on the same layer

The pressure does not disappear as models and contexts improve. Work on combining PagedAttention with FlexAttention frames long-context inference as a memory-efficiency problem caused by conventional KV cache handling. That suggests the serving bottleneck remains structural, not incidental.^[3]

This is the lesson for Thothy's AI server architecture: content intelligence workloads should be designed around explicit serving constraints. Prompt length, output length, batch size, and queue policy are product variables because they decide how quickly findings can become publishable assets.^[3]^[2]

Serving layer concern	Why it matters for Thothy
KV cache growth	Mixed generation and analysis jobs consume memory unevenly.
Batch admission	More useful concurrent work improves publishing throughput.
Memory fragmentation	Wasted GPU memory can reduce the batch size the system can sustain.
Long-context workloads	Trend intelligence and report generation push cache management harder.

Implication

The right metric is completed useful work

A Thothy deployment should treat vLLM as part of the growth loop, not as isolated infrastructure. The unit of value is completed useful work: analyzed videos, generated hooks, drafted reports, translated assets, and publishable fragments that move acquisition or retention.^[1]

PagedAttention matters because it changes the failure mode. Instead of asking whether the model can respond, the operator can ask whether memory management, batching, and queue policy are allowing the GPU to produce enough finished work for the publishing cadence Thothy needs.^[4]^[5]

Recommendation

Measure serving as a publishing dependency

Instrument the AI server around request mix, batch behavior, queue delay, output tokens, and completed content artifacts. For Thothy, vLLM is most valuable when PagedAttention-backed memory efficiency converts GPU time into reliable generation and analysis throughput.

Audit AI serving queues Review generation workloads

Sources

github.com

GitHub - vllm-project/vllm: A high-throughput and memory-efficient ...

Open source

arXiv:2503.18292

[2503.18292] Jenga: Effective Memory Management for Serving LLM with ...

Large language models (LLMs) are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently been proposed to impro

Open source

arXiv:2506.07311

[2506.07311] Paged Attention Meets FlexAttention: Unlocking Long ...

Large Language Models (LLMs) encounter severe memory inefficiencies during long-context inference due to conventional handling of key-value (KV) caches. In this work, we introduce a novel integration of PagedAttention with PyTorch's FlexAttention, addressing i

Open source

arXiv:2309.06180

Efficient Memory Management for Large Language Model Serving with ...

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When ma

Open source

docs.vllm.ai

Paged Attention - vLLM

The purpose of this document is to provide a high-level explanation of the kernel implementation step by step, aiding those who wish to learn about the vLLM multi-head query attention kernel. After going through this document, users will likely have a better u

Open source

lectecy.github.io

vllm | A high-throughput and memory-efficient inference and serving ...

We are excited to announce the last in-person vLLM meetup of the year! Join the vLLM developers and engineers from Snowflake AI Research to chat about the latest LLM inference optimizations and your 2025 vLLM wishlist!

Open source

learnvllm.com

vLLM: The Modern Inference Guide

An interactive guide to modern LLM inference with vLLM : PagedAttention , continuous batching, disaggregated prefill, scheduling, tuning, and benchmarking.

Open source

nm-vllm.readthedocs.io

Welcome to vLLM! — vLLM

vLLM announcing blog post (intro to PagedAttention ) vLLM paper (SOSP 2023) How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al. vLLM Meetups. Documentation # Getting Started Installation Installation

Open source