2020 한국인공지능학회 동계강좌 정리 - 2. 서울대 김건희 교수님, Pretrained Language Model

2020 인공지능학회 동계강좌를 신청하여 2020.1.8 ~ 1.10 3일 동안 다녀왔다. 총 9분의 연사가 나오셨는데, 프로그램 일정은 다음과 같다.

전체를 묶어서 하나의 포스트로 작성하려고 했는데, 주제마다 내용이 꽤 될거 같아, 한 강좌씩 시리즈로 묶어서 작성하게 되었다. 두 번째 포스트에서는 서울대학교 김건희 교수님의 “Pretrained Language Model” 강연 내용을 다룬다.

자연어 처리에서는 사용된 Context 에 따라 단어의 의미가 달라진다.
강의 주제 : How to get contextualized word representation ?
Language Model
1. Word Sequence Likelihood
  - $p(x_{1:T})=p(x_1) \prod^T_{t=1}p(x_t|x_{1:t-1})$
2. Importance of Language Models
  - Play a key role in several NLP tasks, ex) speech recognition, machine translation
3. Language Model Pretrain 의 목적
  - Downstream task 성능 향상
ELMO
1. Peters et al, Deep contextualized word representations, NAACL 2018
2. Word representation from RNN based bidirectional language model
3. How to use ELMO embeddings?
  - 기존 word embedding 에 concatenation 하여 downstream task 변형 없이 사용
BERT
1. Devlin et al, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
2. RNN 기반 Language Model 의 문제점
  - Hard to parallelize efficiently
  - Back propagation through sequence -> gradient vanishing / explosion
  - Transmitting local/global information through single vector -> Difficult to model long-term dependency
3. Autoregressive Language Model 단점
  - Unidirectional
4. Self-Attention
  1. 일반적인 attention 쿼리 절차
    1. Similarity finding with query
    2. Softmax
    3. Weighted value sum (soft attention)
  2. self-attention
    1. key = query : query 가 단어 단체 key 값
    2. self-attention 은 모든 위치의 단어들을 연결 시킨다.
5. BERT : Bidirectional Encoder Representations from Transformers
  1. Use Transformer as a Backbone Language Model Network
  2. Pretrain with two losses :
    1. Masked Language Model
    2. Next Sentence Prediction (NSP)
6. Transformer
  1. Vaswani et al, Attention is All you Need, NIPS 2017
  2. Multi-head Attention
7. Comparison with ELMo
  1. ELMo : Autoregressive LM
  2. BERT : Masked LM
8. Pretraining
  1. Masked LM : bidirectional conditioning
    1. Mask some percentage (15%) of the input tokens at random
    2. Predict the original value of the masked words
    3. Mismatch between pre-training (빈 칸 채우기) and fine-tuning (Downstream Task)
  2. Next Sentence Prediction
    1. 두 문장 사이의 관계를 나타내는데 필요한 Loss, ex) IsNext, NotNext
  3. Input representation
    1. Token embedding
    2. Segment embedding
    3. Position embedding
  4. Use BERT embeddings
    1. Pretrain on the Corpus
    2. Finetune for downstream tasks with few task-specific layers on top of BERT
9. Performance boost
  1. GLUE (General Language Understanding Evaluation) : a collection of diverse language understanding tasks
  2. SQuAD 1.1 / SQuAD 2.0
    - SQuAD : The Stanford Question Answering Dataset
  3. SWAG : Situations with Adversarial Generations
10. Effect of the model size
  1. BERT is a really BIG model
  2. Fine-tuned on a single Cloud TPU with 64GB of RAM
  3. Not reproducible from conventional GPUs with 12GB – 16GB of RAM due to small batch sizes
RoBERTa and ALBERT
1. RoBERTa
  1. Robustly optimized BERT approach, proposed by Facebook
  2. Lie et al, RoBERTa: A Robustly Optimized BERT Pretraining Approach, Arxiv 2019
  3. Characteristics
    1. Dynamic Masking
    2. Remove NSP loss
    3. Training with large batches
    4. Text encoding
    5. 10x more training time and data
2. ALBERT
  1. A Lite BERT
  2. Lan et al, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, ICLR 2020
  3. Characteristics
    1. Factorization embedding parameterization
    2. Cross-Layer parameter sharing
    3. Inter-sentence coherence loss

You Might Also Like

2020 한국인공지능학회 동계강좌 정리 – 5. AITrics 이주호 박사님, Set-input Neural Networks and Amortized Clustering

2020 한국인공지능학회 동계강좌 정리 – 7. AITrics 김세훈 박사님, Meta Learning for Few-shot Classification

2020 한국인공지능학회 동계강좌 정리 – 4. KAIST 신진우 교수님, Adversarial Robustness of DNN