Dual Batch Overlap¶
Motivation¶
The core motivation of the DBO system in vLLM is to overlap the sparse all-to-all communication in the MoE layer with the surrounding computation. This system currently only targets DP+EP deployments.
Introduction¶
The Dual Batch Overlap system works by splitting the batch in the model runner, creating two worker threads, and then running the model on each of these worker threads. When DBO is enabled, yield points within the FusedMoEModularKernel
allow the two CPU worker threads (also called UBatch threads) to ping-pong between each other so that when one is running compute, the other is waiting on communication. Throughout the code, ubatch may be used as a short form of microbatch; this is an ASCII-friendly version of the short form µ-batch.
The DBO system includes modifications to GpuModelRunner
and ModularKernel
, and defines two utility classes: UBatchWrapper
and UBatchContext
. UBatchWrapper
manages thread lifecycle and CUDA graph execution of the model. UBatchContext
wraps ForwardContext
to coordinate synchronization between the two UBatch threads.
Below is the overlap schedule that is currently implemented in vLLM.
# Schedule notation legend:
# S = Shared expert
# A0 = MLA qkv proj,
# A1 = Core attn + out proj + MoE gate
# D = Dispatch
# C = Combine
# Comp: |-A0₀-A1₀-||-MLP₁-||-S₁-MLP₀-||-S₀-A0₁-A1₁-|
# Comm: |----D₁---||--D₀--||----C₁---||-----C₀-----|
# Order: D₁ send, A0₀, A1₀, D₁ recv, D₀ send, MLP₁, D₀ recv,
# C₁ send, S₁, MLP₀, C₁ recv, C₀ send, S₀, A0₁, A1₁, C₀ recv.
# MLP_SHARED_OVERLAP = "mlp_shared_overlap"
Running with DBO¶
To enable the DBO system pass in the --enable-dbo
argument to your vllm serve command. This must be run in conjunction with --data-parallel-size N
where N is greater than 1 and --enable-expert-parallel
. Additionally, there are two configuration knobs.
--dbo-decode-token-threshold
the minimum number of tokens in a decode-only batch required to enable DBO for that batch--dbo-prefill-token-threshold
the minimum number of tokens in a batch containing at least one prefill required to enable DBO for that batch
Currently, DBO is only supported with DeepEP, so DeepEP must be installed and the VLLM_ALL2ALL_BACKEND
environment variable must be set to deepep_low_latency
if your workload is primarily decode requests, or deepep_high_throughput
if your workload is primarily prefill requests.
Below is a command that will spin up a two DP rank server with expert parallelism and DBO enabled. EX: VLLM_ALL2ALL_BACKEND=deepep_low_latency vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --enable-dbo
Note that there must be at least two GPUs visible in CUDA_VISIBLE_DEVICES
DBO Components¶
- GPUModelRunner
- UBatchWrapper
- UBatchContext
GPU Model Runner¶
The batch is split into microbatches by the GPUModelRunner
class. This is accomplished in two steps. First, coordination across all DP ranks is performed to determine whether microbatching will be applied. Microbatching must be uniform across all DP ranks. If microbatching is not feasible for any DP rank, it is disabled for all ranks. If all DP ranks are going to microbatch, the total number of tokens is padded up to the max number of tokens amongst all ranks. If any rank would end up with an empty second microbatch after the padding is applied, microbatching will be aborted and no ranks will microbatch. Once microbatching has been initiated by all ranks, the second step is performed. The CommonAttentionMetadata
is sliced in half by the GPUModelRunner
so that there is one attention metadata per-microbatch.
UBatchWrapper¶
gpu_ubatch_wrapper
The UBatchWrapper
class is a model wrapper that's responsible for all of the thread, UBatchContext, and CUDA graph management for DBO. It's designed to be relatively transparent to the GPU Model Runner.
The implementation runs the model twice, once for each microbatch. Each model invocation occurs within a UBatch thread. These threads are launched in parallel and are synchronized using the UBatchContext
. Each thread is provided with a sliced version of the attention metadata that is used to run its half of the batch.
CUDA graphs for DBO are entirely managed by the UBatchWrapper
. Because of this, DBO only supports running with Full CUDA graphs. However, once a DBO CUDA graph has been captured, it can be replayed without any multithreading or CPU synchronization.
Interfaces¶
The __init__
method takes in the model, VllmConfig, CUDAGraphMode, and device.
The forward
method exclusively takes in model arguments. It determines whether or not to run with DBO based on whether a ubatch_slices
object is present in the forward_context
. Otherwise, the model is run without DBO.
UBatchContext¶
ubatch_context
The UBatchContext
class is a ForwardContext
wrapper class that is used by the UBatchWrapper
class to synchronize the two UBatch threads. It should only be instantiated by using make_ubatch_contexts
.
When one of the UBatch threads reaches a dbo_yield
call, it pauses, and starts the other thread which will run until it reaches the same dbo_yield
call. This "ping-pong" dynamic continues, with threads swapping at each dbo_yield call
, until the model's execution is complete.
The current implementation has all dbo_yield
and dbo_maybe_run_recv_hook
calls in the FusedMoEModularKernel.forward
method.
Interfaces¶
The make_ubatch_context
function initializes two UBatchContexts
, one for each UBatch thread. It takes two CUDA streams, the preexisting ForwardContexts
and a CPU thread barrier. This function should be used exclusively to instantiate UBatchContexts
. It will handle all of the event initialization.
The dbo_register_recv_hook
method registers a callback that can be returned by the FusedMoEPrepareAndFinalize
class in the other UBatch thread’s UBatchContext
. The callback will be run when the other thread calls dbo_maybe_run_recv_hook
. This is typically used to wait on an all-to-all kernel.
The dbo_maybe_run_recv_hook
method runs a callback that’s set by the dbo_register_recv_hook
function if that callback exists.
The dbo_yield
method puts the current thread to sleep and wakes up the other UBatch thread.