vllm.distributed.device_communicators.all_reduce_utils ¶
CUSTOM_ALL_REDUCE_MAX_SIZES module-attribute
¶
CUSTOM_ALL_REDUCE_MAX_SIZES = {
"9.0": {
2: 64 * MiB,
4: 32 * MiB,
6: MiB // 2,
8: MiB // 4,
},
"10.0": {
2: 2 * MiB,
4: 2 * MiB,
6: 1 * MiB,
8: 1 * MiB,
},
}
NCCL_SYMM_MEM_ALL_REDUCE_CONFIG module-attribute
¶
NCCL_SYMM_MEM_ALL_REDUCE_CONFIG: dict[str, Any] = {
"min_world_size": 4,
"thresholds": {4: 2 * MiB, 8: 1 * MiB},
"always_use_above_world_size": 8,
}
SYMM_MEM_ALL_REDUCE_MAX_SIZES module-attribute
¶
SYMM_MEM_ALL_REDUCE_MAX_SIZES = {
"9.0": {
2: 64 * MiB,
4: 32 * MiB,
6: 64 * MiB,
8: 64 * MiB,
},
"10.0": {
2: 8 * MiB,
4: 32 * MiB,
6: 128 * MiB,
8: 128 * MiB,
},
}
can_actually_p2p ¶
Usually, checking if P2P access is enabled can be done by torch.cuda.can_device_access_peer(src, tgt)
. However, sometimes the driver might be broken, and torch.cuda.can_device_access_peer(src, tgt)
returns True
even if P2P access is not actually possible. See https://github.com/vllm-project/vllm/issues/2728 and https://forums.developer.nvidia.com/t/direct-gpu-gpu-communication-does-not-seem-to-work-properly/283264/10 Therefore, we have to perform a real P2P access to check if it is actually possible.
Note on p2p and cuda IPC: Usually, one process uses one GPU: GPU src --> cuda context src --> tensor src --> process src
We need to combine p2p and cuda IPC, so that: GPU src --> cuda context src --> tensor src --> process src |shared| GPU tgt --> cuda context tgt --> tensor tgt --> process tgt That is to say, process src creates a tensor in GPU src, passes IPC handle to process tgt, and process tgt accesses the tensor in GPU tgt. Any operation on the tensor in process tgt will be reflected in the tensor in process src, because they are the same memory segment. It is important to note that process tgt accesses the tensor in GPU tgt, not GPU src. That's why we need p2p access.
The most time-consuming part is the process creation. To avoid creating processes for every pair of GPUs, we use batched testing. We create two processes for testing all pairs of GPUs in batch. The trick is to reset the device after each test (which is not available in PyTorch).
Source code in vllm/distributed/device_communicators/all_reduce_utils.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
|
consumer ¶
consumer(
batch_tgt: Sequence[int],
producer_queue,
consumer_queue,
result_queue,
cuda_visible_devices: Optional[str] = None,
)
Source code in vllm/distributed/device_communicators/all_reduce_utils.py
gpu_p2p_access_check ¶
Check if GPU src can access GPU tgt.
Source code in vllm/distributed/device_communicators/all_reduce_utils.py
producer ¶
producer(
batch_src: Sequence[int],
producer_queue,
consumer_queue,
result_queue,
cuda_visible_devices: Optional[str] = None,
)