POSTECH AI for Visual Computing 2022.2

2022.2.15 15:00~16:30 Alexey Dosovitskiy, Transformers for Computer Vision

A very nice blog post, https://iaml-it.github.io/posts/2021-04-28-transformers-in-vision/

Kolesnikov et al, Big Transfer (BiT): General Visual Representation Learning, ECCV 2020
Barbu et al, ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models, NIPS 2019
Dosovitskiy et al, An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, ICLR 2021
Zhai et al, Scaling Vision Transformers, 2021
Liu et al, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV 2021
Tolstikhin et al, MLP-Mixer: An all-MLP Architecture for Vision, 2021
Ranftl et al, Vision Transformers for Dense Prediction, ICCV 2021
Arnab et al, ViViT: A Video Vision Transformer, ICCV 2021
Carion et al, End-to-End Object Detection with Transformers, ECCV 2020
Zhao et al, Point Transformer, ICCV 2021
Sajjadi et al, Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, arXiv 2021
Caron et al, Emerging Properties in Self-Supervised Vision Transformers, ICCV 2021
Bao et al, BEiT: BERT Pre-Training of Image Transformers, arXiv 2021
Kaiming He et al, Masked Autoencoders Are Scalable Vision Learners, arXiv 2021
Radford et al, Learning Transferable Visual Models From Natural Language Supervision, ICML 2021
Akbari et al, VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, NIPS 2021

2022.2.15 16:30~18:00 Ivan Laptev

Visual Representations from Videos
– Miech et al, End-to-End Learning of Visual Representations from Uncurated Instructional Videos, CVPR 2020 Oral
– Miech et al, HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, ICCV 2019

Video Question Answering
– Yang et al, Just Ask: Learning to Answer Questions from Millions of Narrated Videos, ICCV 2021
– HowToVQA69M Datasets

Vision-and-Language Navigation (VLN)
– Guhur et al, Airbert: In-domain Pretraining for Vision-and-Language Navigation, ICCV 2021
– Chen et al, History Aware Multimodal Transformer for Vision-and-Language Navigation, NIPS 2021

Efficient transformers
– Miech et al, Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, CVPR 2021
– El-Nouby et al, XCiT: Cross-Covariance Image Transformers, NIPS 2021

2022.2.16 13:30~15:00 Jun-Yan Zhu, Human-in-the-loop Model Creation

Their Works
– Isola et al, Image-to-image translation with conditional adversarial networks, CVPR 2017
– Zhu et al, Unpaired image-to-image translation using cycle-consistent adversarial networks, ICCV 2017
– Park et al, GauGAN, SIGGRAPH RealTime Live 2019
– Park et al, Swapping Autoencoder For Deep Image Manipulation, NIPS 2020

Data Augmentation for GANs
– Zhao et al, Differentiable Augmentation for Data Efficient GAN Training, NIPS 2020
– Karras et al, Training Generative Adversarial Networks with Limited Data, NIPS 2020
– Tran et al, On Data Augmentation for GAN Training, IEEE TIP 2020
– Zhao et al, Image Augmentations for GAN Training, arXiv 2020

Customizing a GAN with sketches
– Wang et al, Sketch Your Own GAN, ICCV 2021
– Bau et al, Rewriting a Deep Generative Model, ECCV 2020

2022.2.16 16:30~18:00 Armand Joulin, Advances in Self-supervised Learning

DINO
Caron et al, Emerging Properties in Self-Supervised Vision Transformers, 2021

Vision Transformer
+ Add Exponential Moving Average
+ Simplifying Normalization
+ Centering
+ Sharpening
+ Multi-crop

DINO + ViT : excellent K-NN performance

Application to copy detection

Applications: style transfer
Tumanyan et al, Splicing ViT Features for Semantic Appearance Transfer, 2022

Applications: part discovery
Amir et al, Deep ViT Features as Dense Visual Descriptors, 2022

You Might Also Like

2020봄학기 AI 연구원 콜로퀴움 시리즈 – Human Pose Estimation

2020 한국인공지능학회 동계강좌 정리 – 2. 서울대 김건희 교수님, Pretrained Language Model

2020 한국인공지능학회 동계강좌 정리 – 6. KAIST 양은호 교수님, Deep Generative Models