PRML 2022 Winter School


Lecture#1 Gaussian Processes and Neural Processes, 이주호 교수, KAIST


Bayesian Regression : \theta \sim p(\theta) \text{: uncertainty} 가정.

p(y_*|x_*, X, y) = \int p(y_*|x_*,\theta) p(\theta | X, y) d \theta

data uncertainty (aleatoric) due to observation noise : p(y_*|x_*,\theta)
model uncertainty (epistemic) due to the parameter uncertainty : p(\theta | X, y)

Why don’t we assume priors on functions directly?

A prior over functions = a prior over infinite dimensional vectors = a collection of infinitely many random variables

A stochastic process is a collection of random variables indexed by some input set.

\{ f(x) \}_{x \in R}

Finite-dimensional distributions of stochastic processes

Gaussian processes (GP) : a stochastic process whose finite-dimensional distribution is a multivariate Gaussian.

\text{For any } n \in N, \mu(x) = E[f(x)]. K(x, x') = E[(f(x-\mu(x))(f(x' - \mu(x'))]


Neural Process (NP) : Using neural networks to construct stochastic processes, a type of “implicit” stochastic processes

f(x) = g(x, z; \theta), z \sim p(z)

Conditional Neural Process -> Neural Processes

Other NPs
– Attentive neural processes, Kim et al, 2019
– Sequential NP, Singh et al, 2019
– Convolutional CNP, Gordon et al, 2020
– EquivCNP, Kawano et al, 2021
– Bootstrapping (A)NP, Lee et al, 2020

Lecture#2 심층학습기반의 자연어처리 동향, 이근배 교수, POSTECH


Lecture#3 Kernel based Embedding, 신현정 교수, 아주대학교


0. Kernel Method

Linear method + embedding in feature space

1. Hilbert Space

Topological Space ⊃ Metric Space ⊃ Normed Vector Space ⊃ Inner Product Space ⊃ Hilbert Space

A Hilbert space is a generalization of finite dimensional vector spaces with inner product to possibly infinite dimension. Most of interesting infinite dimensional vector spaces are function
spaces. Hilbert spaces are the simplest among such spaces.

It allows linear algebra and calculus to spaces (Ex) differential and integral calculus

Any continuous linear functional on a Hilbert space is given by an inner product with a vector (Riesz Representation Theorem)

Riesz Representation Theorem

Let H be a Hilbert space over H. If f ∈ H*, then there exists a unique vector u in H such that

f(\nu) = < \nu, u >_H \text {for all } \nu \in H


< \nu, u >_H represents the evaluation of f at \nu, f(\nu)

Reproducing Kernel Hilbert Space

A Hilbert space, in which each point of the space is a continuous linear function

H = \{ f(z): \sum^k_{j=1} \alpha_j \phi_{x_j} (z), \forall k \in N_+ \text {and } x_j \in X \}


Definition of “Reproducing”

f(x) = < f, \phi_x >_H

where for all functions f \in H \text {and } \phi_x \in H is the mapping function


Mercer’s Theorem

If K ∈ H is a continuous symmetric and positive semidefinite function,

K(u, v) = \sum^{\infty}_{i=1} \lambda_i \psi_i(u) \psi_i(\nu) = < \phi(u), \phi(\nu)>_H = \phi(u)^T \phi(\nu)


Symbolically, data points are mapped to RKHS feature space
Operationally, they are only evaluated with inner products



Learning with Kernels, Bernhard Scholkopf and Alexander J. Smola, The MIT Press
Convex Optimization, John Shawe-Taylor and Nello Cristianini, Cambridge University Press

Lecture#4 Random Projections and Fourier Features, 최승진 교수, BARO AI


Lecture#5 Forecasting Future of Video Frames, 홍승훈 교수, KAIST


관측 프레임 : Context Frame x_1, x_2, ..., x_C \rightarrow x_{1:C}  

Prediction : x_{C+1}, ..., x_T \rightarrow x_{C+1:T}


p(x_{C+1}, ..., x_T | x_1, x_2, ..., x_C) \rightarrow p(x_{C+1:T} | x_{1:C})


Videos is a Sequences => Sequence Modeling

– Continuous
– High-Dimensional Sequence

Prediction can be transformed to Synthesis

p(x_{1:T} = p(x_{C+1:T} | x_{1:C}) p(x_{1:C}) = p(x_{ \leq T} )


Image Autoregressive Model

p(x_{ \leq T }) = p(x_T | x_{< T}) p(x_{< T}) = \prod^T_{t=1} p(x_t | x_{< t})


LSTM Model

p(x_t | x_{\leq t}) = g_{\theta}(h_t)
– Stochasticity 가 없다.
– Error 가 축적된다.
– 계산 비용 높다
– 일반화가 어렵다.

Research Agendas in VP

– Stochastic Models
– State-space Models
– Decomposing factors of variations
– Discrete Representation (Vector Quantization)
– Continuous-time Models

① Stochastic Models
– Denton et al., Stochastic Video Generation with a Learned Prior

② State-space Models
– Villegas et al., High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks
– Structured Inference Networks for Nonlinear State Space Models, In AAAI, 2017
– Clark et al., Adversarial Video Generation On Complex Datasets
– Tian et al., A Good Image Generator Is What You Need For High-Resolution Video Synthesis
– Franceschi et al., Stochastic Residual Video Prediction

③ Decomposing factors of variations
– Tulyakov et al., MoCoGAN: Decomposing Motion and Content for Video Generation
– Tian et al., A Good Image Generator Is What You Need For High-Resolution Video Synthesis
– Sexena et al., Clockwork Variational Autoencoders
– Villegas et al., Stochastic Residual Video Prediction
– Lee et al., Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

④ Discrete representation
– Yan et al., VideoGPT: Video Generation using VQ-VAE and Transformers
– Wu et al., NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

⑤ Continuous-time models
– Skorokhodov et al, StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2

Lecture#6 그래프 데이터를 위한 신경망 모형과 응용, 김동우 교수, POSTECH


GNN : Graph Neural Network

목적 : Neural Network 를 통해서 그래프를 학습 시킴.

Naive Approach
=> concatenate adjacent matrix and feature vector

– parameter 개수가 그래프 Size 에 비례
– graph node 개수가 다르면 재활용 불가
– 인접 행렬 Node 가 reordering 가능하지 않음

GNN : node embedding ( vector representation )

Design a Graph Encoder
– Basic Idea : see neighbor nodes
– Recursive -> Tree Structure
– Two steps :
① Aggregation ( Mean, …)
② Combine ( FC layers, … )

In Practice,

Z = \tilde{D}^{-\dfrac{1}{2}} \tilde{A} \tilde{D}^{-\dfrac{1}{2}} X W


\tilde{A} = A + I_N


Node 들의 Embedding 산출 Mechanism => Inductive Capability 로 연결 됨.
– 관찰되지 않은 그래프에 대해서 Embedding 산출 가능


Unsupervised Approach
– “similar” nodes have similar embeddings

Node Level Tasks
– Attribute Masking / Edge prediction / Context prediction

Representation Power of GNN
– graph isomorphism test -> injectivity

Mean pooling / Max pooling are not injective

Injective Multi-set Function

\phi(\sum_{x \in S} f(x))

– sum pooling
– total pairwise squared distance

Adversarial attacks on GNN


Lecture#7 Recent Advances in Text-to-Image Generation Models, 김세훈 박사, Kakao Brain


Auto Regressive Image Generation

Oord et al, Pixel Recurrent Neural Networks, ICML 2016

VQ(Vector Quantization)- VAE
Oord et al, Neural Discrete Representation Learning, NIPS 2017

Ramesh, Zero-Shot Text-to-Image Generation, ICML 2021

Image-Text Pair Datasets

Esser, Taming Transformers for High-Resolution Image Synthesis, CVPR 2021


Diffusion-based Model

Jonathan Ho et al, Denoising diffusion probabilistic models, NIPS 2020

Nichol et al, GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, arXiv 2021

Their Approach (Kakao)
– minDALL-E ; VQGAN + Transformer 1D
– RQ-VAE ; Residual-Quantized VAE

Lecture#8 Contrastive Learning: Backgrounds, theory and video applications, 김은솔 교수, 한양대


Representation Learning
Pre-training, Fine-tuning, Zero-shot Learning

[NeurIPS 2021 Tutorial] Self-Supervised Learning: Self-prediction and Contrastive Learning
– Self-prediction & Contrastive
Self-prediction : “intra-sample” prediction
Contrastive : “inter-sample” prediction

– Auto-regressive prediction : WaveNet, GPT, PixelRNN, PixelCNN
– Masked Generation
  > Early Work : Word2Vec, CBOW & Skip-gram
  > BERT
  > Vision Transformer


– 비슷한 데이터들은 feature space 상에서 가깝게 하고 상관없는 데이터들은 feature space 상에서 멀게 한다는 매우 간단한 아이디어!

– Early Work
  > Hadsell et al, Dimensionality Reduction by Learning an Invariant Mapping, CVPR 2006
     ( Contrast Loss 최초 제안 )

  > Triplet Loss, Schroff et al, FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR 2015
  > Lifted Structured Embedding

– MOCO : KaimingHe et al, Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020

– SimCLR : Chen et al, A Simple Framework for Contrastive Learning of Visual Representations, ICML 2020

Training with Negative Samples are very important !!!

Wang et al, Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, ICML 2020



Video Representation Learning

비디오 데이터는 데이터 마다 특징이 확연히 다르다. 학습 목표에 따른 적합한 데이터 인지가 중요

– Youtube 8M Challenge
  > Input : Image 1024 dim, Audio 128 dim
  > Sequence : 230.2 sec
  > Output : Multi-Label Class labels
  > Data : 6.1M clips / 1.53 TB

Video Datasets
  > One Action with Simple Text
    – Kinetics 400/600/700
    – UCF 101
  > Several Actions with Few Descriptions
    – Something-Something
    – FineGym
  > Long and Multimodal
    – Youtube 8M
    – HowTo100M

Video Representation Learning with Transformers

  • Sun et al, VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019
  • Sun et al, Learning Video Representations using Contrastive Bidirectional Transformer, arXiv 2019
  • Miech et al, End-to-End Learning of Visual Representations from Uncurated Instructional Videos, CVPR 2020
  • Luo et al, UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020




Leave a Comment