Computer Vision
  • Introduction
    • Neural network basics
    • MLE and Cross Entropy
    • Convolution basics
    • Neural Network Categories
  • 2D Backbones
    • ResNet
    • Transformer
      • Recurrent Neural Network
      • Vision Transformer
      • SwinTransformer
  • Methods for Object Detection
  • Object Detection
    • The R-CNN family
    • ROI pool & ROI align
    • FCOS
    • Object Detection in Detectron2
  • Segmentation
    • Fully Convolutional Network
    • Unet: image segmentation
  • Video Understanding
    • I3D: video understanding
    • Slowfast: video recognition
    • ActionFormer: temporal action localization
  • Generative models
    • Autoregressive model
    • Variational Auto-Encoder
    • Generative Adversarial Network
    • Diffusion Models
    • 3D Face Reconstruction
Powered by GitBook
On this page
  • Contribution
  • Methods = Encoder + Decoder
  • Encoder = CNN + Transformer (No positional encoding)
  • Decoder = classification head + regression head
  1. Video Understanding

ActionFormer: temporal action localization

identify actions in time and recognize their categories

PreviousSlowfast: video recognitionNextGenerative models

Last updated 2 years ago

Contribution

localize moments of actions in a single shot, without using action proposals or pre-defined anchor windows.

Methods = Encoder + Decoder

  1. Extract a feature pyramid with a multi-scale Transformer from the input video.

  2. Deploy a convolutional decoder on the feature pyramid to both "classify candidates categories" and "regress the action onset and offset"

Encoder = CNN + Transformer (No positional encoding)

  1. A projection function that uses a convolutional network embeds each feature into D-dimensional space

  2. A multiscale transformer network that maps the embedded features to a feature pyramid.

  • Uses local self-attention to reduce the complexity.

  • In the multiscale transformer, the window size stays the same, while the feature map is down-sampled between different layers

Decoder = classification head + regression head

  1. Classification head: examines each moment ttt across all LLL levels on the pyramid, and predicts the probability of action p(at)p(a_t)p(at​) at every moment

  2. Regression head: examines each moment ttt across all LLL levels on the pyramid, and predicts the distances to the onset and offset of an action