vllm.forward_context ¶
batchsize_logging_interval module-attribute
¶
batchsize_logging_interval: float = (
VLLM_LOG_BATCHSIZE_INTERVAL
)
BatchDescriptor ¶
Bases: NamedTuple
Batch descriptor for cudagraph dispatching. We should keep the num of items as minimal as possible to properly and uniquely describe the padded batch for cudagraph.
Source code in vllm/forward_context.py
non_uniform property
¶
non_uniform: BatchDescriptor
Return a non-uniform version of current batch descriptor.
DPMetadata dataclass
¶
Source code in vllm/forward_context.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
|
__init__ ¶
__init__(
max_tokens_across_dp_cpu: Tensor,
num_tokens_across_dp_cpu: Tensor,
local_sizes: Optional[list[int]] = None,
) -> None
chunked_sizes ¶
Context manager to compute and temporarily set the per-rank local token sizes for a specific chunk during chunked forward execution.
This is necessary to ensure each DP (data parallel) rank processes its designated portion of tokens in lockstep with others, even when the token counts are uneven or some ranks have completed their input early.
For chunked execution, we break up the total tokens on each rank into multiple chunks (of at most max_chunk_size_per_rank
), and for a given chunk_idx
, this context manager sets self.local_sizes
to the number of tokens to process in that chunk on each rank.
self.local_sizes
is only valid inside the context.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sequence_parallel_size | int | When Attn is TP and MoE layers are EP, we use SP between the layers to avoid redundant ops. We need this value to compute the chunked sizes. | required |
max_chunk_size_per_rank | int | The max number of tokens each rank is allowed to process in this chunk. | required |
chunk_idx | int | The index of the chunk to compute sizes for. | required |
Source code in vllm/forward_context.py
get_chunk_sizes_across_dp_rank ¶
make staticmethod
¶
make(
parallel_config: ParallelConfig,
num_tokens: int,
num_tokens_across_dp_cpu: Tensor,
) -> DPMetadata
Source code in vllm/forward_context.py
sp_local_sizes ¶
sp_local_sizes(sequence_parallel_size: int)
Context mamager for setting self.local_sizes. Same as self.chunked_sizes but without any chunking.
Source code in vllm/forward_context.py
ForwardContext dataclass
¶
Source code in vllm/forward_context.py
attn_metadata instance-attribute
¶
attn_metadata: Union[
AttentionMetadata,
dict[str, AttentionMetadata],
list[dict[str, AttentionMetadata]],
]
batch_descriptor class-attribute
instance-attribute
¶
batch_descriptor: Optional[BatchDescriptor] = None
cudagraph_runtime_mode class-attribute
instance-attribute
¶
cudagraph_runtime_mode: CUDAGraphMode = NONE
no_compile_layers instance-attribute
¶
Type AttentionMetadata for v0, Type Dict[str, AttentionMetadata] for v1, map from layer_name of each attention layer to its attention metadata Type List[Dict[str, AttentionMetadata]] for DBO. List of size two, one for each microbatch. Set dynamically for each forward pass
__init__ ¶
__init__(
no_compile_layers: dict[str, Any],
attn_metadata: Union[
AttentionMetadata,
dict[str, AttentionMetadata],
list[dict[str, AttentionMetadata]],
],
virtual_engine: int,
dp_metadata: Optional[DPMetadata] = None,
cudagraph_runtime_mode: CUDAGraphMode = NONE,
batch_descriptor: Optional[BatchDescriptor] = None,
ubatch_slices: Optional[UBatchSlices] = None,
) -> None
_compute_chunked_local_num_tokens ¶
_compute_chunked_local_num_tokens(
num_tokens_across_dp_cpu: Tensor,
sequence_parallel_size: int,
max_num_tokens: int,
chunk_idx: int,
) -> list[int]
Source code in vllm/forward_context.py
_compute_sp_num_tokens ¶
_compute_sp_num_tokens(
num_tokens_across_dp_cpu: Tensor,
sequence_parallel_size: int,
) -> list[int]
Source code in vllm/forward_context.py
create_forward_context ¶
create_forward_context(
attn_metadata: Any,
vllm_config: VllmConfig,
virtual_engine: int = 0,
dp_metadata: Optional[DPMetadata] = None,
cudagraph_runtime_mode: CUDAGraphMode = NONE,
batch_descriptor: Optional[BatchDescriptor] = None,
ubatch_slices: Optional[UBatchSlices] = None,
)
Source code in vllm/forward_context.py
get_forward_context ¶
get_forward_context() -> ForwardContext
Get the current forward context.
Source code in vllm/forward_context.py
override_forward_context ¶
override_forward_context(
forward_context: Optional[ForwardContext],
)
A context manager that overrides the current forward context. This is used to override the forward context for a specific forward pass.
Source code in vllm/forward_context.py
set_forward_context ¶
set_forward_context(
attn_metadata: Any,
vllm_config: VllmConfig,
virtual_engine: int = 0,
num_tokens: Optional[int] = None,
num_tokens_across_dp: Optional[Tensor] = None,
cudagraph_runtime_mode: CUDAGraphMode = NONE,
batch_descriptor: Optional[BatchDescriptor] = None,
ubatch_slices: Optional[UBatchSlices] = None,
)
A context manager that stores the current forward context, can be attention metadata, etc. Here we can inject common logic for every model forward pass.
Source code in vllm/forward_context.py
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 |
|