Learning Spatiotemporal Features with 3D Convolutional Networks: C3D

Machine Learning, Deep Learning

Learning Spatiotemporal Features with 3D Convolutional Networks: C3D

n.han 2021. 5. 12. 21:33

Learning-Spatiotemporal-Features-with-3D-Convolutional-Networks-C3D

Introduction

Video analysis
- 인터넷에 비디오 데이터 양이 폭발적으로 많아지고 있음. 비디오 분석에는 Action recognition, abnormal event detection 등 다양한 문제들이 있고 연구되고 있는데, 이를 범용적으로 다룰 수 있는 일반적인 모델이 없음.
- 비디오 descriptor(비디오에서 visual features를 추출해주는 모델)는 네가지 성질이 있어야 함.
  1. generic: 비디오는 다양한 타입의 데이터가 있기 때문에.
  2. compact: 데이터의 크기가 커서, scalable하려면 compact해야 함.
  3. efficient: 수천개의 비디오들이 수분내로 분석되길 기대하기 때문에.
  4. simple: 복잡한 feature methods 쓰는 것보다 간단해야 함.
C3D (3D ConvNet)
- ConvNet이 이미지 도메인에서 괄목할만한 성과를 보음. ConvNet으로부터 이미지의 feature를 얻을 수 있어 transfer learning 등에 활용되고 있는데, 이 ConvNet은 직접적으로 비디오 데이터에 쓸 수 없음.
- 이 연구에서는 비디오 데이터로부터 공간적이고, 시간적인 (Spatio-temporal feature)들을 배울 수 있는 3D ConvNet을 제안. 3D ConvNet은 기존에도 있었던 개념인데, 비디오 분석 태스크에 적용하는 연구.
- C3D 모델은 비디오 descriptor가 가져야 할 네가지 좋은 성질들을 갖고 있음.
Contributions
1. 3D ConvNet이 비디오 데이터에 잘 동작되는 것을 실험적으로 보임.
2. 3 X 3 X 3 Conv Kernel을 모든 layer에 적용하는 것이 가장 좋음을 empirically 찾음.
3. 여러가지 task에 bechmarks들보다 outperform 했음.

Related Work

비디오 분석의 각 태스크에 맞는 모델들이 개발되어왔지만, 범용적이지 않았고 Computationsally intensive 했었음.
- STIPs, SIFT-3D, HOG3D, iDT ...
이미지 데이터들에 ConvNet은 활발히 활용.
Two stream network로 action recognition을 한 work.

3D ConvNet으로 human action recognition한 work가 가장 관련 깊은 연구.
- S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, 35(1):221–231, 2013.
- 사람의 action을 인식하기 위해 head tracking 머신을 써서, segmented video를 데이터 셋으로 사용하였음.
- 본 연구는 이런 전처리 없이 전체 비디오 프레임을 사용하여, 더 다양한 데이터 셋에 scalable함.

Learning Features with 3D ConvNets: First experiment to search best kernal size d

3D Conv과 Pooling을 spatio-temporally feature를 얻기 위해 씀. 2D Conv는 Spatially만.

(a): 1 channel의 2D Conv. (b): L channel의 2D Conv. Lose temporal information. (c): 3D Conv. Preserving temporal information.

VGGNET에서 3 X 3 Small convolutional kernel를 깊게 쌓아서 좋은 성능을 냈었고, 이를 C3D에서도 차용.
Dataset: UCF101 (First experiment with UCF101, a medium-scale dataset, to search for the best architecture) with 101 differenct actions.

Notions
- Video clips: c x l x h x w (channel, length in number of frames, height, weight)
- Kernel size: d x k x k (d: temporal depth, k: spatial size)
Common network settings
- Input
  1. resized 128 X 171
  2. split into 16-frame clips. (3 x 16 x 128 x 171)
  3. jittering with random crops with size 3 X 16 X 112 X 112.
- 5 Conv-pooling layer pair + 2 fc layers + softmax (Kernal depth size를 위해 5 Conv만 쓰는데, 나중에는 3 Conv가 추가 됨)
- Conv
  - Number of filters: from 1 to 5 are 64, 128, 256, 256, 256 and d tempoeral depth (temporal depth는 실험적으로 변경해가며 찾음. 나중에 나옴)
  - Spatial and temporal padding + stride 1 -> same size
- Pooling
  - Max pooling
  - Kernel size of first pooling: 1 X 2 X 2 (Temporal signal을 너무 빨리 merge하지 않도록), 그 외 2 X 2 X 2
  - Stride는 Kernel size 고려하여 안 겹치면서 움직임.
- batch size: 30, learning rate: 0.003, traning stopped after 16 epochs (kernal size를 바꿔가며 좋은 kernel size를 먼저 정하려고 epoch를 조금 돌린 것으로 보임)

C3D architecture (Common setting에서 3, 4, 5번째에 conv layer가 각각 하나씩 추가됨)

Varying network architectures
- 어떻게 temporal information을 aggregate할 것인가? Kernal size만 바꿔서 테스트해봄
- 결론은 depth-3이 best!
- depth-1, 3, 5, 7은 모든 kernel temporal depth를 d로 setting
- increase depth는 3-3-5-5-7 / decrease depth: 7-5-5-3-3
- depth-1과 depth-7에서 사용한 parameter 수의 차이는 굉장히 큰데, accuracy 차이가 그리 높지는 않음.

Learning Features with 3D ConvNets: Spatiotemporal feature learning

common network design에 conv layer를 두 개 더 추가하고, depth-3로 고정하여 spatio-temporal feature를 learning해보자!
Conv: 3 X 3 X 3 Filter / Padding 'same' / Strid 1
Pooling: 1 X 2 X 2 (pool1) / 2 X 2 X 2 (pool 2, 3, 4, 5) (stride는 pooling size만큼 안 겹치도록 하여 사이즈가 반으로 줄어듦)

Dataset: Sports-1M (1.1 million sports video, 487 sports categories, the largest at that time)

Training: extract five 2-second long clips from every training video / horizonally flip with 50% prob / SGD with batch size 30 / Intial learning rate is 0.003 and is divided by 2 every 150K iterations / Stopped at 13 epochs
Results

Convolution pooling은 비디오에서 굉장히 많은 clip를 써썼고, DeepVideo는 clip 수는 C3D와 비슷하지만 clip으로부터 더 많은 crop들을 했음. 그럼에도 Video at top five accuracy에서 C3D가 84.4%, fine-tuned C3D가 85.2%를 기록하여 DeepVideo를 이김.

Visualization of spatio-temporal features
- C3D가 배운 feature는 무엇일까? Deconv method (M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.)로 C3D가 내부적으로 학습한 것을 이해할 수 있음.
- C3D가 처음 몇 frame들은 appearance에 집중하다가, 나중에는 motion에 집중하는 것을 관찰하였음.
- 아래 그림은 Conv5b의 feature map을 deconv하여 image space로 투영한 것. Feature map을 보면 처음에는 전체 appearance에 집중하다가, frame이 지날수록 motion을 tracking함을 볼 수 있음(Selectively attend)

Feature embedding을 t-SNE로 시각화한 그림. C3D feature가 semantically separable한 모습을 볼 수 있음.

저작자표시 비영리 변경금지

'Machine Learning, Deep Learning' 카테고리의 다른 글

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (0)	2021.07.02
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting (0)	2021.06.02
Autoencoding beyond pixels using a learned similarity metric (VAE-GAN) (0)	2021.04.22
Data augmentation: color jitter (0)	2021.04.18
Variational autoencoder (1)	2021.04.16

현재글Learning Spatiotemporal Features with 3D Convolutional Networks: C3D

Where the light is

Learning Spatiotemporal Features with 3D Convolutional Networks: C3D

'Machine Learning, Deep Learning' 카테고리의 다른 글

'Machine Learning, Deep Learning'의 다른글

티스토리툴바

Learning Spatiotemporal Features with 3D Convolutional Networks: C3D

'Machine Learning, Deep Learning' 카테고리의 다른 글

'Machine Learning, Deep Learning'의 다른글

관련글

티스토리툴바