vis4d.op.layer.positional_encoding

Positional encoding for transformer.

Modified from mmdetection (https://github.com/open-mmlab/mmdetection).

Classes

`LearnedPositionalEncoding`(num_feats[, ...])	Position embedding with learnable embedding weights.
`SinePositionalEncoding`(num_feats[, ...])	Position encoding with sine and cosine functions.

class SinePositionalEncoding(num_feats, temperature=10000, normalize=False, scale=6.283185307179586, eps=1e-06, offset=0.0)[source]

Position encoding with sine and cosine functions.

See End-to-End Object Detection with Transformers for details.

__init__(num_feats, temperature=10000, normalize=False, scale=6.283185307179586, eps=1e-06, offset=0.0)[source]

Initialization for SinePositionalEncoding.

Parameters:

num_feats (int) – The feature dimension for each position along x-axis or y-axis. Note the final returned dimension for each position is 2 times of this value.
temperature (int, optional) – The temperature used for scaling the position embedding. Defaults to 10000.
normalize (bool, optional) – Whether to normalize the position embedding. Defaults to False.
scale (float, optional) – A scale factor that scales the position embedding. The scale will be used only when normalize is True. Defaults to 2*pi.
eps (float, optional) – A value added to the denominator for numerical stability. Defaults to 1e-6.
offset (float, optional) – offset add to embed when do the normalization. Defaults to 0.

forward(mask, inputs=None)[source]

Forward function for SinePositionalEncoding.

Parameters:

mask (Tensor | None) – ByteTensor mask. Non-zero values representing ignored positions, while zero values means valid positions for this image. Shape [bs, h, w]. If None, it means single image or batch image with no padding.
inputs (Tensor | None) – The input tensor. It mask is None, this input tensor is required to get the shape of the input image.

Returns:

Returned position embedding with shape: [bs, num_feats*2, h, w].

Return type:

pos (Tensor)

class LearnedPositionalEncoding(num_feats, row_num_embed=50, col_num_embed=50)[source]

Position embedding with learnable embedding weights.

__init__(num_feats, row_num_embed=50, col_num_embed=50)[source]

Initialization for LearnedPositionalEncoding.

Parameters:

num_feats (int) – The feature dimension for each position along x-axis or y-axis. The final returned dimension for each position is 2 times of this value.
row_num_embed (int, optional) – The dictionary size of row embeddings. Defaults to 50.
col_num_embed (int, optional) – The dictionary size of col embeddings. Defaults to 50.

init_weights()[source]

Initialize the weights of position embedding.

Return type:: None

forward(mask)[source]

Forward function for LearnedPositionalEncoding.

Parameters:

mask (Tensor) – ByteTensor mask. Non-zero values representing ignored positions, while zero values means valid positions for this image. Shape [bs, h, w].

Returns:

Returned position embedding with shape: [bs, num_feats*2, h, w].

Return type:

pos (Tensor)