# PRML 2022 Winter School

2022.2.15.

### Lecture#1 Gaussian Processes and Neural Processes, 이주호 교수, KAIST

Bayesian Regression : $\theta \sim p(\theta) \text{: uncertainty}$ 가정.

$p(y_*|x_*, X, y) = \int p(y_*|x_*,\theta) p(\theta | X, y) d \theta$

data uncertainty (aleatoric) due to observation noise : $p(y_*|x_*,\theta)$
model uncertainty (epistemic) due to the parameter uncertainty : $p(\theta | X, y)$

##### Why don’t we assume priors on functions directly?

A prior over functions = a prior over infinite dimensional vectors = a collection of infinitely many random variables

A stochastic process is a collection of random variables indexed by some input set.

$\{ f(x) \}_{x \in R}$

Finite-dimensional distributions of stochastic processes

Gaussian processes (GP) : a stochastic process whose finite-dimensional distribution is a multivariate Gaussian.

$\text{For any } n \in N,$ $\mu(x) = E[f(x)].$ $K(x, x') = E[(f(x-\mu(x))(f(x' - \mu(x'))]$

Neural Process (NP) : Using neural networks to construct stochastic processes, a type of “implicit” stochastic processes

$f(x) = g(x, z; \theta), z \sim p(z)$

Conditional Neural Process -> Neural Processes Other NPs
– Attentive neural processes, Kim et al, 2019
– Sequential NP, Singh et al, 2019
– Convolutional CNP, Gordon et al, 2020
– EquivCNP, Kawano et al, 2021
– Bootstrapping (A)NP, Lee et al, 2020

2022.2.16

### Lecture#3 Kernel based Embedding, 신현정 교수, 아주대학교

0. Kernel Method

Linear method + embedding in feature space

1. Hilbert Space

Topological Space ⊃ Metric Space ⊃ Normed Vector Space ⊃ Inner Product Space ⊃ Hilbert Space

A Hilbert space is a generalization of finite dimensional vector spaces with inner product to possibly infinite dimension. Most of interesting infinite dimensional vector spaces are function
spaces. Hilbert spaces are the simplest among such spaces.

It allows linear algebra and calculus to spaces (Ex) differential and integral calculus

Any continuous linear functional on a Hilbert space is given by an inner product with a vector (Riesz Representation Theorem)

Riesz Representation Theorem

Let H be a Hilbert space over H. If f ∈ H*, then there exists a unique vector u in H such that

$f(\nu) = < \nu, u >_H \text {for all } \nu \in H$

$< \nu, u >_H$ represents the evaluation of f at $\nu, f(\nu)$

Reproducing Kernel Hilbert Space

A Hilbert space, in which each point of the space is a continuous linear function

$H = \{ f(z): \sum^k_{j=1} \alpha_j \phi_{x_j} (z), \forall k \in N_+ \text {and } x_j \in X \}$

Definition of “Reproducing”

$f(x) = < f, \phi_x >_H$

where for all functions $f \in H \text {and } \phi_x \in H$ is the mapping function

Mercer’s Theorem

If K ∈ H is a continuous symmetric and positive semidefinite function,

$K(u, v) = \sum^{\infty}_{i=1} \lambda_i \psi_i(u) \psi_i(\nu) = < \phi(u), \phi(\nu)>_H = \phi(u)^T \phi(\nu)$

Symbolically, data points are mapped to RKHS feature space
Operationally, they are only evaluated with inner products References

Learning with Kernels, Bernhard Scholkopf and Alexander J. Smola, The MIT Press
Convex Optimization, John Shawe-Taylor and Nello Cristianini, Cambridge University Press

2022.2.17

### Lecture#5 Forecasting Future of Video Frames, 홍승훈 교수, KAIST

관측 프레임 : Context Frame $x_1, x_2, ..., x_C \rightarrow x_{1:C}$

Prediction : $x_{C+1}, ..., x_T \rightarrow x_{C+1:T}$

$p(x_{C+1}, ..., x_T | x_1, x_2, ..., x_C) \rightarrow p(x_{C+1:T} | x_{1:C})$

Videos is a Sequences => Sequence Modeling

Challenges
– Continuous
– High-Dimensional Sequence

Prediction can be transformed to Synthesis

$p(x_{1:T} = p(x_{C+1:T} | x_{1:C}) p(x_{1:C}) = p(x_{ \leq T} )$

Image Autoregressive Model

$p(x_{ \leq T }) = p(x_T | x_{< T}) p(x_{< T}) = \prod^T_{t=1} p(x_t | x_{< t})$

LSTM Model

$p(x_t | x_{\leq t}) = g_{\theta}(h_t)$
– Stochasticity 가 없다.
– Error 가 축적된다.
– 계산 비용 높다
– 일반화가 어렵다. ### Research Agendas in VP

– Stochastic Models
– State-space Models
– Decomposing factors of variations
– Discrete Representation (Vector Quantization)
– Continuous-time Models

① Stochastic Models
– Denton et al., Stochastic Video Generation with a Learned Prior

② State-space Models
– Villegas et al., High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks
– Structured Inference Networks for Nonlinear State Space Models, In AAAI, 2017
– Clark et al., Adversarial Video Generation On Complex Datasets
– Tian et al., A Good Image Generator Is What You Need For High-Resolution Video Synthesis
– Franceschi et al., Stochastic Residual Video Prediction

③ Decomposing factors of variations
– Tulyakov et al., MoCoGAN: Decomposing Motion and Content for Video Generation
– Tian et al., A Good Image Generator Is What You Need For High-Resolution Video Synthesis
– Sexena et al., Clockwork Variational Autoencoders
– Villegas et al., Stochastic Residual Video Prediction
– Lee et al., Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

④ Discrete representation
– Yan et al., VideoGPT: Video Generation using VQ-VAE and Transformers
– Wu et al., NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

⑤ Continuous-time models
– Skorokhodov et al, StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

### Lecture#6 그래프 데이터를 위한 신경망 모형과 응용, 김동우 교수, POSTECH

GNN : Graph Neural Network

목적 : Neural Network 를 통해서 그래프를 학습 시킴.

Naive Approach
=> concatenate adjacent matrix and feature vector

문제점
– parameter 개수가 그래프 Size 에 비례
– graph node 개수가 다르면 재활용 불가
– 인접 행렬 Node 가 reordering 가능하지 않음

GNN : node embedding ( vector representation )

Design a Graph Encoder
– Basic Idea : see neighbor nodes
– Recursive -> Tree Structure
– Two steps :
① Aggregation ( Mean, …)
② Combine ( FC layers, … ) In Practice,

$Z = \tilde{D}^{-\dfrac{1}{2}} \tilde{A} \tilde{D}^{-\dfrac{1}{2}} X W$

$\tilde{A} = A + I_N$

Node 들의 Embedding 산출 Mechanism => Inductive Capability 로 연결 됨.
– 관찰되지 않은 그래프에 대해서 Embedding 산출 가능

Unsupervised Approach
– “similar” nodes have similar embeddings

– Attribute Masking / Edge prediction / Context prediction

Representation Power of GNN
– graph isomorphism test -> injectivity

Mean pooling / Max pooling are not injective

Injective Multi-set Function

$\phi(\sum_{x \in S} f(x))$

– sum pooling
– total pairwise squared distance

2022.2.18

### Lecture#7 Recent Advances in Text-to-Image Generation Models, 김세훈 박사, Kakao Brain

Auto Regressive Image Generation

Oord et al, Pixel Recurrent Neural Networks, ICML 2016
Salimans et al, PIXELCNN++: IMPROVING THE PIXELCNN WITH DISCRETIZED LOGISTIC MIXTURE LIKELIHOOD AND OTHER MODIFICATIONS, ICLR 2017

VQ(Vector Quantization)- VAE
Oord et al, Neural Discrete Representation Learning, NIPS 2017

DALL-E
Ramesh, Zero-Shot Text-to-Image Generation, ICML 2021

Image-Text Pair Datasets
MSCOCO / CC3M / CC12M / WIT / CLIP / ALIGN

VQ-GAN
Esser, Taming Transformers for High-Resolution Image Synthesis, CVPR 2021

Diffusion-based Model

DDPM
Jonathan Ho et al, Denoising diffusion probabilistic models, NIPS 2020

GLIDE
Nichol et al, GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, arXiv 2021

Their Approach (Kakao)
– minDALL-E ; VQGAN + Transformer 1D
– RQ-VAE ; Residual-Quantized VAE

### Lecture#8 Contrastive Learning: Backgrounds, theory and video applications, 김은솔 교수, 한양대

Representation Learning
Pre-training, Fine-tuning, Zero-shot Learning

[NeurIPS 2021 Tutorial] Self-Supervised Learning: Self-prediction and Contrastive Learning
– Self-prediction & Contrastive
Self-prediction : “intra-sample” prediction
Contrastive : “inter-sample” prediction Self-prediction
– Auto-regressive prediction : WaveNet, GPT, PixelRNN, PixelCNN
> Early Work : Word2Vec, CBOW & Skip-gram
> BERT
> Vision Transformer

Contrastive
– 비슷한 데이터들은 feature space 상에서 가깝게 하고 상관없는 데이터들은 feature space 상에서 멀게 한다는 매우 간단한 아이디어! – Early Work
> Hadsell et al, Dimensionality Reduction by Learning an Invariant Mapping, CVPR 2006
( Contrast Loss 최초 제안 )  > Triplet Loss, Schroff et al, FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR 2015
> Lifted Structured Embedding

– MOCO : KaimingHe et al, Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020

– SimCLR : Chen et al, A Simple Framework for Contrastive Learning of Visual Representations, ICML 2020

Training with Negative Samples are very important !!!

Wang et al, Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, ICML 2020  Video Representation Learning

비디오 데이터는 데이터 마다 특징이 확연히 다르다. 학습 목표에 따른 적합한 데이터 인지가 중요

> Input : Image 1024 dim, Audio 128 dim
> Sequence : 230.2 sec
> Output : Multi-Label Class labels
> Data : 6.1M clips / 1.53 TB

Video Datasets
> One Action with Simple Text
– Kinetics 400/600/700
– UCF 101
> Several Actions with Few Descriptions
– Something-Something
– FineGym
> Long and Multimodal
– HowTo100M

Video Representation Learning with Transformers

• Sun et al, VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019
• Sun et al, Learning Video Representations using Contrastive Bidirectional Transformer, arXiv 2019
• Miech et al, End-to-End Learning of Visual Representations from Uncurated Instructional Videos, CVPR 2020
• Luo et al, UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020