vis4d.op.layer.ms_deform_attn

Multi-Scale Deformable Attention Module.

Modified from Deformable DETR (https://github.com/fundamentalvision/Deformable-DETR/blob/main/models/ops/modules/ms_deform_attn.py) # pylint: disable=line-too-long

Functions

`is_power_of_2`(number)	Check if a number is a power of 2.
`ms_deformable_attention_cpu`(value, ...)	CPU version of multi-scale deformable attention.

Classes

`MSDeformAttention`([d_model, n_levels, ...])	Multi-Scale Deformable Attention Module.
`MSDeformAttentionFunction`(args, *kwargs)	Multi-Scale Deformable Attention Function module.
`MultiScaleDeformableAttention`([embed_dims, ...])	A wrapper for `MSDeformAttention`.

class MSDeformAttentionFunction(*args, **kwargs)[source]

Multi-Scale Deformable Attention Function module.

static forward(ctx, value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, im2col_step)[source]

Forward pass.

Return type:: Tensor

static backward(ctx, grad_output)[source]

Backward pass.

Return type:: tuple[Tensor, None, None, Tensor, Tensor, None]

ms_deformable_attention_cpu(value, value_spatial_shapes, sampling_locations, attention_weights)[source]

CPU version of multi-scale deformable attention.

Parameters:

value (Tensor) – The value has shape (bs, num_keys, mum_heads, embed_dims // num_heads)
value_spatial_shapes (Tensor) – Spatial shape of each feature map, has shape (num_levels, 2), last dimension 2 represent (h, w).
sampling_locations (Tensor) – The location of sampling points, has shape (bs ,num_queries, num_heads, num_levels, num_points, 2), the last dimension 2 represent (x, y).
attention_weights (Tensor) – The weight of sampling points used when calculate the attention, has shape (bs ,num_queries, num_heads, num_levels, num_points),

Returns:

has shape (bs, num_queries, embed_dims).

Return type:

Tensor

is_power_of_2(number)[source]

Check if a number is a power of 2.

Return type:: None

class MSDeformAttention(d_model=256, n_levels=4, n_heads=8, n_points=4, im2col_step=64)[source]

Multi-Scale Deformable Attention Module.

This is the original implementation from Deformable DETR.

__init__(d_model=256, n_levels=4, n_heads=8, n_points=4, im2col_step=64)[source]

Creates an instance of the class.

Parameters:

d_model (int) – Hidden dimensions.
n_levels (int) – Number of feature levels.
n_heads (int) – Number of attention heads.
n_points (int) – Number of sampling points per attention head per feature level.
im2col_step (int) – The step used in image_to_column. Default: 64.

forward(query, reference_points, input_flatten, input_spatial_shapes, input_level_start_index, input_padding_mask=None)[source]

Forward function.

Parameters:

query (Tensor) – (n, length_{query}, C).
reference_points (Tensor) – (n, length_{query}, n_levels, 2), range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area or (n, length_{query}, n_levels, 4), add additional (w, h) to form reference boxes.
input_flatten (Tensor) – (n, sum_{l=0}^{L-1} H_l cdot W_l, C).
input_spatial_shapes (Tensor) – (n_levels, 2), [(H_0, W_0), (H_1, W_1), …, (H_{L-1}, W_{L-1})]
input_level_start_index (Tensor) – (n_levels, ), [0, H_0*W_0, H_0*W_0+H_1*W_1, H_0*W_0+H_1*W_1+H_2*W_2, …, H_0*W_0+H_1*W_1+…+H_{L-1}*W_{L-1}]
input_padding_mask (Tensor) – (n, sum_{l=0}^{L-1} H_l cdot W_l), True for padding elements, False for non-padding elements.

Return type:

Tensor

Retrun: output (Tensor): (n, length_{query}, C).

__call__(query, reference_points, input_flatten, input_spatial_shapes, input_level_start_index, input_padding_mask=None)[source]

Type definition for call implementation.

Return type:: Tensor

class MultiScaleDeformableAttention(embed_dims=256, num_heads=8, num_levels=4, num_points=4, im2col_step=64, dropout=0.0)[source]

A wrapper for MSDeformAttention.

This module implements MSDeformAttention with identity connection, and positional encoding is also passed as input.

__init__(embed_dims=256, num_heads=8, num_levels=4, num_points=4, im2col_step=64, dropout=0.0)[source]: Init.

forward(query, reference_points, input_flatten, input_spatial_shapes, input_level_start_index, query_pos=None, identity=None, input_padding_mask=None)[source]

Forward function.

Parameters:

query (Tensor) – The input query with shape [bs, num_queries, embed_dims].
reference_points (Tensor) – (bs, num_queries, num_levels, 2), range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area or (bs, num_queries, num_levels, 4), add additional (w, h) to form reference boxes.
input_flatten (Tensor) – (bs, sum_{l=0}^{L-1} H_l cdot W_l, C).
input_spatial_shapes (Tensor) – (num_levels, 2), [(H_0, W_0), (H_1, W_1), …, (H_{L-1}, W_{L-1})].
input_level_start_index (Tensor) – (num_levels, ), [0, H_0*W_0, H_0*W_0+H_1*W_1, H_0*W_0+H_1*W_1+H_2*W_2, …, H_0*W_0+H_1*W_1+…+H_{L-1}*W_{L-1}].
query_pos (Tensor | None) – The positional encoding for query, with the same shape as query. If not None, it will be added to query before forward function. Defaults to None.
identity (Tensor | None) – With the same shape as query, it will be used for the identity link. If None, query will be used. Defaults to None.
input_padding_mask (Tensor) – (bs, sum_{l=0}^{L-1} H_l cdot W_l), True for padding elements, False for non-padding elements.

Return type:

Tensor

Returns: output (Tensor): (bs, num_queries, C).