vis4d.common.distributed

This module contains utilities for multiprocess parallelism.

Functions

all_gather_object_cpu(data[, tmpdir, ...])

Share arbitrary picklable data via file system caching.

all_gather_object_gpu(data[, ...])

Run pl_module.all_gather on arbitrary picklable data.

all_reduce_dict(py_dict[, reduce_op, to_float])

Apply all reduce function for python dict object.

broadcast(obj[, src])

Broadcast an object from a source to all processes.

create_tmpdir(rank[, tmpdir, use_system_tmp])

Create and distribute a temporary directory across all processes.

distributed_available()

Check if torch.distributed is available.

get_local_rank()

Get the local rank of the current process in torch.distributed.

get_rank()

Get the global rank of the current process in torch.distributed.

get_world_size()

Get the world size (number of processes) of torch.distributed.

is_module_wrapper(module)

Checks recursively if a module is wrapped.

obj2tensor(pyobj[, device])

Serialize picklable python object to tensor.

pad_to_largest_tensor(tensor)

Pad tensor to largest size among the tensors in each process.

rank_zero_only(func)

Allows the decorated function to be called only on global rank 0.

reduce_mean(tensor)

Obtain the mean of tensor on different GPUs.

serialize_to_tensor(data)

Serialize arbitrary picklable data to a Tensor.

synchronize()

Sync (barrier) among all processes when using distributed training.

tensor2obj(tensor)

Deserialize tensor to picklable python object.

get_world_size()[source]

Get the world size (number of processes) of torch.distributed.

Returns:

The world size.

Return type:

int

get_rank()[source]

Get the global rank of the current process in torch.distributed.

Returns:

The global rank.

Return type:

int

get_local_rank()[source]

Get the local rank of the current process in torch.distributed.

Returns:

The local rank.

Return type:

int

distributed_available()[source]

Check if torch.distributed is available.

Returns:

Whether torch.distributed is available.

Return type:

bool

synchronize()[source]

Sync (barrier) among all processes when using distributed training.

Return type:

None

broadcast(obj, src=0)[source]

Broadcast an object from a source to all processes.

Return type:

Any

serialize_to_tensor(data)[source]

Serialize arbitrary picklable data to a Tensor.

Parameters:

data (Any) – The data to serialize.

Returns:

The serialized data as a Tensor.

Return type:

Tensor

Raises:

AssertionError – If the backend of torch.distributed is not gloo or nccl.

rank_zero_only(func)[source]

Allows the decorated function to be called only on global rank 0.

Parameters:

func (GenericFunc) – The function to decorate.

Returns:

The decorated function.

Return type:

GenericFunc

pad_to_largest_tensor(tensor)[source]

Pad tensor to largest size among the tensors in each process.

Parameters:

tensor (Tensor) – tensor to be padded.

Returns:

size of the tensor, on each rank Tensor: padded tensor that has the max size

Return type:

list[int]

all_gather_object_gpu(data, rank_zero_return_only=True)[source]

Run pl_module.all_gather on arbitrary picklable data.

Parameters:
  • data (Any) – any picklable object

  • rank_zero_return_only (bool) – if results should only be returned on rank 0

Returns:

list of data gathered from each process

Return type:

list[Any]

create_tmpdir(rank, tmpdir=None, use_system_tmp=True)[source]

Create and distribute a temporary directory across all processes.

Return type:

str

all_gather_object_cpu(data, tmpdir=None, rank_zero_return_only=True, use_system_tmp=False)[source]

Share arbitrary picklable data via file system caching.

Parameters:
  • data (Any) – any picklable object.

  • tmpdir (Optional[str]) – Save path for temporary files. If None, safely create tmpdir.

  • rank_zero_return_only (bool) – if results should only be returned on rank 0.

  • use_system_tmp (bool) – if use system tmpdir or not.

Returns:

list of data gathered from each process.

Return type:

list[Any]

reduce_mean(tensor)[source]

Obtain the mean of tensor on different GPUs.

Return type:

Tensor

obj2tensor(pyobj, device=device(type='cuda'))[source]

Serialize picklable python object to tensor.

Parameters:
  • pyobj (Any) – Any picklable python object.

  • device (torch.device) – Device to put on. Defaults to “cuda”.

Return type:

Tensor

tensor2obj(tensor)[source]

Deserialize tensor to picklable python object.

Parameters:

tensor (Tensor) – Tensor to be deserialized.

Return type:

Any

all_reduce_dict(py_dict, reduce_op='sum', to_float=True)[source]

Apply all reduce function for python dict object.

The code is modified from https://github.com/Megvii-BaseDetection/YOLOX/blob/main/yolox/utils/allreduce_norm.py.

NOTE: make sure that py_dict in different ranks has the same keys and the values should be in the same shape. Currently only supports NCCL backend.

Parameters:
  • py_dict (DictStrAny) – Dict to be applied all reduce op.

  • reduce_op (str) – Operator, could be ‘sum’ or ‘mean’. Default: ‘sum’.

  • to_float (bool) – Whether to convert all values of dict to float. Default: True.

Returns:

reduced python dict object.

Return type:

DictStrAny

is_module_wrapper(module)[source]

Checks recursively if a module is wrapped.

Two modules are regarded as wrapper: DataParallel, DistributedDataParallel.

Parameters:

module (nn.Module) – The module to be checked.

Returns:

True if the input module is a module wrapper.

Return type:

bool