Facebook论文为实现跨语种ZeroShot迁移的巨量多语言句子Embeddings

共计 14080 个字符，预计需要花费 36 分钟才能阅读完成。

Mikel Artetxe
Holger Schwenk (Facebook)
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

本文介绍了一种可以学习多语言句子表示的方法，可用于 30 多个语种，93 种语言 and written in 28 different scripts.
系统用了所有语言共享 BPE 词汇表的单 BiLSTM 编码器，同时又在 parallel corpora 上训练的 auxiliary 解码器。
这种技术允许我们只在英语上 annotated data 训练出的句子 embedding 模型的基础上训练分类器，然后迁移到其他 93 种语言上，不需要任何修改

它由编码器（encoder）、解码器（decoder）两大部分组成。其中，编码器是个无关语种的 BiLSTM，负责构建句嵌入，这些句嵌入接下来会通过线性变来换初始化 LSTM 解码器。为了让这样一对编码器、解码器能处理所有语言，还有个小条件：编码器最好不知道输入的究竟是什么语言，这样才能学会独立于语种的表示。所以，还要从所有输入语料中学习出一个“比特对嵌入词库”（BPE）。
不过，解码器又有着完全相反的需求：它得知道输入的究竟是什么语言，才能得出相应的输出。于是，Facebook 就为解码器附加了一项输入：语言 ID，也就是上图的 Lid
训练这样一个系统，Facebook 用了 16 个英伟达 V100 GPU，将 batch size 设置为 12.8 万个 token，花 5 天时间训练了 17 个周期。
用包含 14 种语言的跨语种自然语言推断数据集（cross-lingual natural language inference，简称 XNLI）来测试，这种多语种句嵌入（上图的 Proposed method）零数据（Zero-Shot）迁移成绩，在其中 13 种语言上都创造了新纪录，只有西班牙语例外。另外，Facebook 用其他任务测试了这个系统，包括 ML-Doc 数据集上的分类任务、BUCC 双语文本数据挖掘。他们还在收集了众多外语学习者翻译例句的 Tatoeba 数据集基础上，制造了一个 122 种语言对齐句子的测试集，来证明自家算法在多语言相似度搜索任务上的能力。
http://www.sohu.com/a/2854308…

BPE vocabulary（Byte Pair Encoding：Byte pair encoding 是一种简单的数据压缩技术，它把句子中经常出现的字节 pairs 用一个没有出现的字节去替代。）

A new state-of-the-art on zero-shot cross-lingual natural language inference for all the 14 languages in the XNLI dataset but one.

The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations. The corpus is made to evaluate how to perform inference in any language (including low-resources ones like Swahili or Urdu) when only English NLI data is available at training time. One solution is cross-lingual sentence encoding, for which XNLI is an evaluation benchmark.

Also achieve very competitive results in cross-lingual document classification(MLDoc dataset)
Our sentence embeddings are also strong at parallel corpus mining, establishing a new state-of-the-art in the BUCC shared task for 3 of its 4 language pairs.
也制作了一个新的测试集 of aligned sentences in 122 languages based on the Tatoeba corpus and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages.
Our Pytorch implementation, pre-trained encoder and the multilingual test set will be freely available.

Natural language inference
Natural language inference is the task of determining whether a“hypothesis”is true (entailment), false (contradiction), or undetermined (neutral) given a“premise”.

MultiNLI
The Multi-Genre Natural Language Inference (MultiNLI) corpus contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but covers a range of genres of spoken and written text and supports cross-genre evaluation. The data can be downloaded from the MultiNLI website.
SciTail
The SciTail entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist“in the wild”. Hypotheses were created from science questions and the corresponding answer candidates, while relevant web sentences from a large corpus were used as premises. Models are evaluated based on accuracy.

Public leaderboards for in-genre (matched) and cross-genre (mismatched) evaluation are available, but entries do not correspond to published models.

State-of-the-art results can be seen on the SNLI website.

SNLI：The Stanford Natural Language Inference (SNLI) Corpus contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy.

The advance techiniques in NLP are known to be particularly data hungry, limiting their applicability in many practical scenarios.
An increasingly popular approach to alleviate this issue is to first learn general language representations on unlabeled data, which are then integrated in task-specific downstream systems
This approach was first popularized by word embeddings, but has recently been superseded by sentence-level representaions.
Nevertheless, all these works learn a separate model for each language and are thus unable to leverage information across different languages, greatly limiiting their potential performance for low-resource languges
Universal langauage agnostic sentence embeddings, that is, vertor representations of sentences that are general with respect to two dimensions: the input language and the NLP tasks

由于语料不足，所以大家都现在无监督学习数据表征，如 word embeddings，现在转向了 sentence embeddings。大家也都在尝试跨语言和多任务学习。

The hope that languages with limited resources benefit from joint training over many languages, the desire to perform zero-shot transfer of an NLP model from one language (e.g. English) to another,
And the possibility to handle code-switching.

We achieve the this by using a single encoder that can handle multiple languages, so that senantically similar sentences in different languages are close in the resulting embedding space.

语码转换
说明语码转换是一个常见的语言现象，指一个人在一个对话中交替使用多于一种语言或其变体。此现象是众多语言接触现象之一，常出现于多语者的日常语言。除了日常语言的对话，语码转换也出现于文字书写中。“语码转换”之讨论必定会牵涉“双语”之内容。语码转换的语料中可见两种以上语言在语音、句法结构等多方面的相互影响。

We learn one shared encoder that can handle 93 different languages. All languages are jointly embedded in a shared space, in contrast to most other works which usally consider separate English/foreign alignments.
Cross-lingual 1)natural lanuage inference (XNLI dataset) and 2)classification (ML-Doc dataset), 3)bitext mining (BUCC dataset) and 4)multilingual similarity search (Tatoeba dataset)

推理、分类、bitext，多语言相似搜索

We define a new test set based on the freely available Tatoeba corpus and provide baseline results for 122 languages. We report accuracy for multilingual similarity search on this test set, but the corpus could also be used for MT evaluation.

Tatoeba
English-German Sentence Translation Database (Manythings/Tatoeba)The Tatoeba Project is also run by volunteers and is set to make the most bilingual sentence translations available between many different languages.Manythings.org compiles the data and makes it accessible.http://www.manythings.org/cor…

Bitext API 是另一个深度语言分析工具，提供易于导出到各种数据管理工具的数据。该平台产品可用于聊天机器人和智能助手、CS 和 Sentiment，以及一些其他核心 NLP 任务。这个 API 的重点是语义、语法、词典和语料库，可用于 80 多种语言。此外，该 API 是客户反馈分析自动化方面的最佳 API 之一。该公司声称可以将洞察的准确度做到 90%
文档: https://docs.api.bitext.com/
Demo: http://parser.bitext.com/
强烈推荐 20 个必须了解的 API，涉及机器学习、NLP 和人脸检测

Word Embeddings (Distributed Representations of Words and Phrases and their Compositionality)
Glove GloVe: Global Vectors for Word Representation

There has been an increasing interest in learning continuous vector representations of longer linguistic units like sentences.
These sentence embeddins are commonly obtained using a Recurrent Neural Network (RNN) encoder, which is typically trained in an unsupervised way over large collections of unlabelled corpora.

一、文本表示和各词向量间的对比
1、文本表示哪些方法？
下面对文本表示进行一个归纳，也就是对于一篇文本可以如何用数学语言表示呢？
基于 one-hot、tf-idf、textrank 等的 bag-of-words；
主题模型：LSA（SVD）、pLSA、LDA；
基于词向量的固定表征：word2vec、fastText、glove
基于词向量的动态表征：elmo、GPT、bert
2、怎么从语言模型理解词向量？怎么理解分布式假设？
上面给出的 4 个类型也是 nlp 领域最为常用的文本表示了，文本是由每个单词构成的，而谈起词向量，one-hot 是可认为是最为简单的词向量，但存在维度灾难和语义鸿沟等问题 ； 通过构建共现矩阵并利用 SVD 求解构建词向量，则计算复杂度高；而早期词向量的研究通常来源于语言模型，比如 NNLM 和 RNNLM，其主要目的是语言模型，而词向量只是一个副产物。

所谓分布式假设，用一句话可以表达：相同上下文语境的词有似含义。而由此引申出了 word2vec、fastText，在此类词向量中，虽然其本质仍然是语言模型，但是它的目标并不是语言模型本身，而是词向量，其所作的一系列优化，都是为了更快更好的得到词向量。glove 则是基于全局语料库、并结合上下文语境构建词向量，结合了 LSA 和 word2vec 的优点。
3、传统的词向量有什么问题？怎么解决？各种词向量的特点是什么？
上述方法得到的词向量是固定表征的，无法解决一词多义等问题，如“川普”。为此引入基于语言模型的动态表征方法：elmo、GPT、bert。

各种词向量的特点：
（1）One-hot 表示：维度灾难、语义鸿沟；
（2）分布式表示 (distributed representation)：

矩阵分解（LSA）：利用全局语料特征，但 SVD 求解计算复杂度大；
基于 NNLM/RNNLM 的词向量：词向量为副产物，存在效率不高等问题；
word2vec、fastText：优化效率高，但是基于局部语料；
glove：基于全局预料，结合了 LSA 和 word2vec 的优点；
elmo、GPT、bert：动态特征；

5、word2vec 和 fastText 对比有什么区别？（word2vec vs fastText）
1）都可以无监督学习词向量，fastText 训练词向量时会考虑 subword；
2）fastText 还可以进行有监督学习进行文本分类，其主要特点：
结构与 CBOW 类似，但学习目标是人工标注的分类结果；
采用 hierarchical softmax 对输出的分类标签建立哈夫曼树，样本中标签多的类别被分配短的搜寻路径；
引入 N -gram，考虑词序特征；
引入 subword 来处理长词，处理未登陆词问题；

6、glove 和 word2vec、LSA 对比有什么区别？（word2vec vs glove vs LSA）
1）glove vs LSA
LSA（Latent Semantic Analysis）可以基于 co-occurance matrix 构建词向量，实质上是基于全局语料采用 SVD 进行矩阵分解，然而 SVD 计算复杂度高；
glove 可看作是对 LSA 一种优化的高效矩阵分解算法，采用 Adagrad 对最小平方损失进行优化；
2）word2vec vs glove
word2vec 是局部语料库训练的，其特征提取是基于滑窗的；而 glove 的滑窗是为了构建 co-occurance matrix，是基于全局语料的，可见 glove 需要事先统计共现概率；因此，word2vec 可以进行在线学习，glove 则需要统计固定语料信息。
word2vec 是无监督学习，同样由于不需要人工标注；glove 通常被认为是无监督学习，但实际上 glove 还是有 label 的，即共现次数
$log(X_{ij})$。
word2vec 损失函数实质上是带权重的交叉熵，权重固定；glove 的损失函数是最小平方损失函数，权重可以做映射变换。
总体来看，glove 可以被看作是更换了目标函数和权重函数的全局 word2vec。

elmo、GPT、bert 三者之间有什么区别？（elmo vs GPT vs bert）
之前介绍词向量均是静态的词向量，无法解决一次多义等问题。下面介绍三种 elmo、GPT、bert 词向量，它们都是基于语言模型的动态词向量。下面从几个方面对这三者进行对比：
（1）特征提取器：elmo 采用 LSTM 进行提取，GPT 和 bert 则采用 Transformer 进行提取。很多任务表明 Transformer 特征提取能力强于 LSTM，elmo 采用 1 层静态向量 + 2 层 LSTM，多层提取能力有限，而 GPT 和 bert 中的 Transformer 可采用多层，并行计算能力强。
（2）单 / 双向语言模型：
GPT 采用单向语言模型，elmo 和 bert 采用双向语言模型。但是 elmo 实际上是两个单向语言模型（方向相反）的拼接，这种融合特征的能力比 bert 一体化融合特征方式弱。
GPT 和 bert 都采用 Transformer，Transformer 是 encoder-decoder 结构，GPT 的单向语言模型采用 decoder 部分，decoder 的部分见到的都是不完整的句子；bert 的双向语言模型则采用 encoder 部分，采用了完整句子。

见知乎: nlp 中的词向量对比：word2vec/glove/fastText/elmo/GPT/bert

Skip-thought model 2015 couple the encoder with an auxiliary decoder, and train the entire system end-to-end to predict the surrounding sentences over a large collection of books.
It was later shown that more competitive results could be obtained by training the encoder over labeled Natural Language INference (NLI) data 2017
This was recently extended to multitask learning , combining different training objectives like that of skip-though, NLI and manchine traslation 2018.

we introduce auxiliary decoders: separate decoder models which are only used to provide alearning signal to the encoders.
Hierarchical Autoregressive Image Models with Auxiliary Decoders

While the previous methods consider a single language at a time, multilingual representaions have attracted a large attention in recent times.

Most of research focuses on cross-lingual word embeddings 2017 which are commonly learned jointly from parallel corpora 2015.
An alternative approach that is becoming increasingly polular is to train word embeddings independently for each language over monoligual corpora, and then map them to a shared space based on a bilingual dictionary 2013 2018.
Cross-lingual word embeddins are often used to build bag-of-word representations of longer linguistic units by taking their centroid 2012.

While this approach has the advantage of requireing a weak (or even no) cross-lingual signal, it has been shown that the resulting sentence embeddings words raher poorly in practical cross-lingual transfer settings 2018.

A more competitive approach that we follow here is to use a sequence-to-sequence encoder-decoder architecture.

The full system is trained end-to-end on parallel corpora akin to neural machine translation: the encoder maps the source sequence into a fixed-length vector representaion, which is used by the decoder to create the target sequence.
This decoder is then discarded, and the encoder is kept to embed sentences in any of the training languages
While some proposals use a separate encoder for each language 2018 sharing a single encoder for all languages also gives strong results.

Nevertheless, most existing work is either limited to few, rather close languages or more commonly, consider pairwise joint embeddings with English and one foreign luaguage only.
Tothe best of our knowledge, all existing word on learning multilingual representations for a large number of languages is limited to word embeddings, our being the first paper exporing massively multilingual sentence representatios.
While all the previous approaches learn a fixed-length representation for each sentence, a recent research line has obtained very strong results using variable-length representation instead, consisting of contextualized embeddings of the words in the sentence.

For that purpose, these methods train either an RNN or self-attentional encoder over unnaotated corpora using some form of language modeling. A classifier can then learned on top of the resulting encoder,
which is commnly further fine-tuned during this supervised training.
Despite the strong performance of these approaches in monolingual settings, we argue that fixed-length approaches provide a more generic, flexible and compatible representation form for our multilingual scenario,
and our model indeed outperforms the multilingual BERT modelin zero-shot transfer

作者用了一个 single, language agnostic 的 BiLSTM encoder 来构建 sentence embeddings，并一并在 parallel corpora 上生成了 auxiliary decoder。

laser 主要原理
laser 主要原理是将所有语种的用多层 bi-lstm encode，最后 state 拿出来，然后用 max-pooling 变成固定维度的向量，用这个向量去 decode，训练时候翻译成 2 个语种，论文说 1 个语种的效果不好，翻译成 2 个目标语种就行，也不一定所有的都需要翻译成 2 个，大部分翻译就行。然后在下游应用，把 encoder 拿回来用，decoder 就没啥用了。
然后发现语料少的语种在和语料的多一起训练过程中有受益。
知乎：Google bert 和 Facebook laser 的对比

As it can be seen, sentence embeddings are obtained by applying a max-pooling operation over the output of a BiLSTM encoder.
These sentence embeddings are used to initialize the decoder LSTM through a linear transformation, and are also concatenated to its input embeddings at every time step.
Note that there is no other connection between the encoder and the decoder, as we want all relevent information of the input sequence to captured by the sentence embedding
For the purpose, we build a joint byte-pair encoding (BPE) vocabulary with 50k operations, which is learned on the concatentaion of all training corpora.
This way, the encoder has no explicit signal on what the input language is, encouraging it to learn language is, encourageing it to learn language independent representations.
In contrast, the decoder takes a language ID embedding that specifies the language to generate, which is concatenated to the input and sentence embeddings at every time step.

*In this paper, we limit our study to a stacked BiLSTM with 1 to 5 layers, each 512-dimensional.
The resulting sentence represtations (after concatenating both directions) are 1024 dimensional.
The decoder has always one layer of dimension 2048. The input embedding size is set to 320, while the language ID embedding has 32 dimensions.

In preceding work, each sentence at the input was jontly translated into all other languges. While this approach was shown to learn high-quality representaions,
it poses two obvious drawbacks when trying to scale to a large number of languages.

First, it requires an N-way parallel corpus, which is difficult to obtain for all languages.
Second, it has a quadratic cost with respect to the number of languages, making training prohibitively slow as the number of languages is increased.

In our preliminary experiments, we observed that similar results can be obtained by using less target languages – two seem to be enough. (Note that, if we had a single target language, the only way to train the encoder for that language would be auto-encoding, which we observe to work poorly. Having two target languages avoids this problem.)

At the same time, we relax the requirement for N-way parallel corpora by considering independent alignments with the two target languages, e.g. we do not require each source sentence to be translated into two target languages.
Training minimizes the cross-entropy loss on the training corpus, alternating over all combinations of the languages involved.
For thea purpose, we use Adam with a constant learning rate of 0.001 and dropout set to 0.1, and train for a fixed number of epochs.(Implementation based on fairseq)
With a total batch size of 128,000 tokens. Unless otherwise specified, we train our model for 17 epochs, which takes about 5 days. Stopping traiing early decreases the overall performance only slightly.

Abstract

Introduction

Contributions

Related Work

相关知识补充

深入解剖 word2vec

Motivation

Proposed method

Just My Socks（注册教程内含优惠码）