2020 한국인공지능학회 동계강좌 정리 - 9. 연세대 황승원 교수님, Knowledge in Neural NLP

2020 인공지능학회 동계강좌를 신청하여 2020.1.8 ~ 1.10 3일 동안 다녀왔다. 총 9분의 연사가 나오셨는데, 프로그램 일정은 다음과 같다.

전체를 묶어서 하나의 포스트로 작성하려고 했는데, 주제마다 내용이 꽤 많을거 같아, 한 강좌씩 시리즈로 묶어서 작성하게 되었다. 마지막 아홉 번째 포스트에서는 연세대 황승원 교수님의 “Knowledge in Neural NLP” 강연 내용을 다룬다.

Basic of NLP
1. Neural NLP in one hour (This section is based on the lecture below)
  - http://videolectures.net/DLRLsummerschool2018_neubig_language_understanding/
  - Main categories of NLP tasks
    - Language modeling : P(text)
      - “Does this sound like good English/French/…?”
    - Text classification : P(label | text)
      - “Is this review a positive review?”
      - “What topic does this article belong to?”
    - Sequence transduction :P(text | text)
      - “How do you say this in Japanese?”
      - “How do you respond to this comment?”
    - Language Analysis : P(labels/tree/graph | text)
      - “Is this word a person, place or thing?”
      - “What is the syntactic structure of this sentence?”
      - “What is the latent meaning of this sentence?”
  - Language Model
    - Phenomena to handle
      - Morphology / Syntax / Semantics / Discourse / Pragmatics / Multilinguality, …
      - Neural network gives us a flexible tool to handle these
  - Sentence classification
    - Sentiment Analysis : very good / good / neutral / bad / very bad
      - I hate this movie => “very bad”
      - I love this movie => “very good”
    - A first Try : Bag of Words (BoW)
      - each word has its own 5 elements corresponding to [very good, good, neutral, bad, very bad]
      - But there are some problems to extract combination features
        
        I don’t love this movie.
      - Continuous Bag of Words (CBoW)
    - With n-grams
      - Allow us to capture combination features in a simple way
      - ex) “don’t love”, “not the best”
      - Problem
        
        Leads to sparsity : many of the n-grams will never be seen in a training corpus
        
        No sharing between similar words / n-grams
      - books.google.com/ngrams
    - Another approaches
      - Time Delay Neural Networks
      - CNN
        
        Generally 1D convolution $\approx$ Time Delay Neural Network (TDNN)
        
        CNN are great for short-distance feature extractors
        
        Don’t have holistic view of the sentence
    - RNN
      - To remember Long-distance dependencies
      - Weakness
        
        Indirect passing of information => Made better by LSTM/GRU but not perfect
        
        Can be slow
    - Count-based Model
      - Count up the frequency and divide
        $P_{ML}(x_i|x_{i-n+1}, ..., x_{i-1}) := \dfrac{c(x_{i-n+1}, ..., x_i)}{c(x_{i-n+1}, ..., x_{i-1})}$
      - Add smoothing, to deal with zero counts
        $P_{ML}(x_i|x_{i-n+1}, ..., x_{i-1}) := \lambda P_{ML}(x_i | x_{i-n+1}, ..., x_{i-1}) + (1-\lambda)P(x_i|x_{1-n+2}, ..., x_{i-1})$
      - Problems
        
        Cannot share strength among similar words
        
        Cannot condition on context with intervening words
        
        Cannot handle long-distance dependencies
    - Featurized Models
      - Calculate features of the context
      - Based on the features, calculate probabilities
    - Linear Models can’t learn feature combinations
    - Neural Language Models
      - Strength
        
        Similar output words get similar rows in the softmax matrix
        
        Similar contexts get similar hidden states
        
        Word embeddings : Similar input words get similar vectors
  - Sequence Transduction
    - Conditioned Language Models
      - Input X / Output Y (Text) / Task
        
        Structured Data / NL Description / NL Generation
        
        English / Japanese / Translation
        
        Document / Short Description / Summarization
        
        Utterance / Response / Response Generation
        
        Image / Test / Image Captioning
        
        Speech / Transcript / Speech Recognition
        
        Dialog ( Multi-turn
    - Attention
      - Encode each word in the sentence into a vector
      - When decoding, perform a linear combination of these vectors, weighted by “attention weights”
      - Use this combination in picking the next word
2. Learn More
  - Stanford CS224n, [Link]
  - CMU CS11-747, [Link]
Knowledge in NLP
1. NLP examples
  - Q & A
  - Information Extraction
  - Dialogue : Smart Reply, but very limited
  - Sentiment Analysis : This is either a research area or prediction area such like stock price prediction.
2. Entity2Topic Module
  - By using Entity Linking, one can insert additional information to improve the performance of abstractive summarization models
  - Tips before attention
    - Remove ambiguity on information
    - Eliminate unnecessary information
    - Emphasize more important information
3. Sentence Classification
  - Problem
    - Context sparsity : Features should be extracted from a single sentence
  - Domain-dependent Solution : This is not easy.
    - Look Neighboring sentences
    - Refer other reviews
    - Seek Latent topics
  - Translation as CS contexts
    - Always available
    - Ambiguity Resolution
    - Extensible Context
    - Better Classifying in WordVec
4. Beyond BERT
  - The BERT is powerful, but has limitations on self-training
    - Human reporting bias ( trivial things can not be reported)
    - Weak for Adversarial examples
  - Considerations
    - Use additional information on Entities
    - Knowledge Encoding
    - Knowledge Distillation
    - Augmentation
    - Paraphrase
    - Supervision of Attention
5. References
  - Learning with Limited Data for Multilingual Reading Comprehension, EMNLP 2019
  - MICRON: Multigranular Interaction for Contextualizing Representation in Non-factoid Question Answering, EMNLP 2019
  - Soft Representation Learning for Sparse Transfer, ACL 2019
  - Explanatory and Actionable Debugging for Machine Learning: A TableQA Demonstration, SIGIR 2019 (demo)
  - Counterfactual Attention Supervision, ICDM 2019
  - Categorical Metadata Representation for Customized Text Classification, TACL 2019
  - Cold-Start Aware User and Product Attention for Sentiment Classification, ACL 2018
  - Translations as Additional Contexts for Sentence Classification, IJCAI 2018
  - Entity Commonsense Representation for Neural Abstractive Summarization, NAACL 2018