ActionFormer: temporal action localization
identify actions in time and recognize their categories
Last updated
identify actions in time and recognize their categories
Last updated
localize moments of actions in a single shot, without using action proposals or pre-defined anchor windows.
Extract a feature pyramid with a multi-scale Transformer from the input video.
Deploy a convolutional decoder on the feature pyramid to both "classify candidates categories" and "regress the action onset and offset"
A projection function that uses a convolutional network embeds each feature into D-dimensional space
A multiscale transformer network that maps the embedded features to a feature pyramid.
Uses local self-attention to reduce the complexity.
In the multiscale transformer, the window size stays the same, while the feature map is down-sampled between different layers
Classification head: examines each moment across all levels on the pyramid, and predicts the probability of action at every moment
Regression head: examines each moment across all levels on the pyramid, and predicts the distances to the onset and offset of an action