关于美团:美团SemEval2022结构化情感分析跨语言赛道冠军方法总结

美团语音交互部针对跨语言结构化情感剖析工作中短少小语种的标注数据、传统办法优化老本昂扬的问题，通过利用跨语言预训练语言模型、多任务和数据加强办法在不同语言间实现低成本的迁徙，相干办法取得了 SemEval 2022 结构化情感剖析跨语言赛道的冠军。

SemEval（International Workshop on Semantic Evaluation）是一系列国内自然语言解决（NLP）研讨会，也是自然语言解决畛域的权威国内比赛，其使命是推动语义剖析的研究进展，并帮忙一系列日益具备挑战性的自然语言语义问题创立高质量的数据集。本次 SemEval-2022（The 16th International Workshop on Semantic Evaluation）蕴含 12 个工作，波及一系列主题，包含习语检测和嵌入、讥刺检测、多语言新闻相似性等工作，吸引了包含特斯拉、阿里巴巴、支付宝、滴滴、华为、字节跳动、斯坦福大学等企业和科研机构参加。

其中 Task 10: 结构化情感剖析（Structured Sentiment Analysis）属于信息抽取（Information Extraction）畛域。该工作蕴含两个子工作（别离是 Monolingual Subtask- 1 和 Zero-shot Crosslingual Subtask-2），蕴含五种语言共 7 个数据集（包含英语、西班牙语、加泰罗尼亚语、巴斯克语、挪威语），其中子 Subtask- 1 应用全副七个数据集，Subtask- 2 应用其中的三个数据集（西班牙语、加泰罗尼亚语、巴斯克语）。咱们在参加该评测工作的三十多支队伍中获得 Subtask- 1 第二名和 Subtask-2 第一名，相干工作已总结为一篇论文 MT-Speech at SemEval-2022 Task 10: Incorporating Data Augmentation and Auxiliary Task with Cross-Lingual Pretrained Language Model for Structured Sentiment Analysis，并收录在 NAACL 2022 Workshop SemEval。

结构化情感剖析工作（Structured Sentiment Analysis, SSA）的目标是抽取出文本中人们对创意、产品或政策等的认识，并结构化地表白为观点四元组 – Opinion tuple Oi (h, t, e, p)，包含 Holder（主体）、Target（客体）、情绪表白（Expression）、极性（Polarity）四种因素，表征了 Holder（主体）对 Target（客体）的情绪表白（Expression），和对应的极性（Polarity）。观点四元组能够用 Sentiment Graphs 来具象化贮存和示意（如下图 1 所示），图中展现了两个例句，别离用英文和巴斯克语表白了“某些人给 the new UMUC 大学评五分是不可信的”这个意思。第一句英文示例蕴含了两个观点四元组，别离是 O1 (h, t, e, p) = (Some others, the new UMUC, 5 stars, positive)，以及 O2 (h, t, e, p) = (, them, don’t believe, negative)。

较量工作有两个：

Monolingual 工作：已知测试集的语种，容许应用雷同语种的有标签数据进行训练。总分取七个数据集的宏均匀 Sentiment F1。
Crosslingual 工作：不容许应用和测试集语种雷同语言的有标签数据进行训练（测评数据集是其中的三个小语种数据集 – 西班牙语，加泰罗尼亚语，巴斯克语）。

数据集	语言	阐明	链接 / 参考文献
MultiBCA	加泰罗尼亚语	Catalan hotel reviews	Barnes, Jeremy, Patrik Lambert, and Toni Badia. 2018.“MultiBooked: A Corpus of Basque and Catalan Hotel Reviews Annotated for Aspect-Level Sentiment Classification.”ArXiv:1803.08614 [Cs], March. http://arxiv.org/abs/1803.08614.
MultiBEU	巴斯克语	Basque hotel reviews	Barnes, Jeremy, Patrik Lambert, and Toni Badia. 2018.“MultiBooked: A Corpus of Basque and Catalan Hotel Reviews Annotated for Aspect-Level Sentiment Classification.”ArXiv:1803.08614 [Cs], March. http://arxiv.org/abs/1803.08614.
OpeNerES	西班牙语	Spanish hotel reviews	https://www.researchgate.net/…，
OpeNerEN	英语	English hotel reviews	https://www.researchgate.net/…，
MPQA	英语	MPQA2.0 (news wire text in English. http://mpqa.cs.pitt.edu/corpo…)	Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2-3):165–210. https://doi.org/10.1007/s1057…
DSUnis	英语	English reviews of online universities	Cigdem Toprak, Niklas Jakob, and Iryna Gurevych. 2010. Sentence and expression level annotation of opinions in user-generated discourse. https://aclanthology.org/P10-…
NoReCFine	挪威语	Norwegian professional reviews in multiple domains	Øvrelid, Lilja, Petter Mæhlum, Jeremy Barnes, and Erik Velldal. 2020.“A Fine-Grained Sentiment Dataset for Norwegian.”ArXiv:1911.12722 [Cs], April. http://arxiv.org/abs/1911.12722.

较量的评估指标是 Sentiment Graph F1（SF1, 缩写沿用论文 ^[5] 的写法），评估预测四元组和标签四元组的重合度。除了须要应用传统的真阳性（True Positive, TP）、假阳性（False Positive, FP），假阴性（False Negative, FN）、真阴性（True Negative, TN）参加指标计算，还额定定义了加权真阳性（Weighted True Positive, WTP）^[5]为观点元组级别的准确匹配 – 即观点元组的极性判断正确时，三个元素（Holder，Target，Expression）的预测片段和实在标签片段的均匀重合水平（若有多个匹配的观点元组，则取均匀重合度最大的元组）为 WTP 的值（具体可进一步参考^[5]），如果 WTP 大于 0，则 TP 为 1，否则 TP 为 0。若极性判断谬误，则 WTP 和 TP 都为 0。观点元组标签的 Holder 或者 Target 片段能够为空，此时，相应的要求预测的 Holders 或者 Targets 片段也要为空，否则不算胜利匹配。可见观点元组的准确匹配要求是十分高的。

计算观点元组精准率时，$\text{Tuple Precision} = \text{WTP}_P / (TP + FP)$
计算观点元组召回率时，$\text{Tuple Recall} = \text{WTP}_R / (TP + FN)$
最终的 Sentiment Graph F1 (SF1)为
$$SF1 = \frac{2 * (\text{Tuple Precision} * \text{Tuple Recall}) }{(\text{Tuple Precision} + \text{Tuple Recall})}$$

结构化情感剖析工作的支流办法是采纳流水线的形式，别离进行 Holder、Target 和 Expression 的信息抽取等子工作，再进行情感分类。然而，这样的办法不能捕捉多个子工作之间的依赖关系，且存在工作的误差流传。

为了解决这个问题，Barnes et al. (2021) ^[5]利用基于图的依存剖析（Dependency Parsing）来捕捉观点四元组内各因素之间的依赖关系，其中情感主体、客体和情绪表白都是节点，它们之间的关系则是弧。该模型过后在 SSA 工作上取得了最佳成果。然而，上述 Barnes et al. (2021) ^[5] 的办法依然存在一些问题。首先，预训练语言模型（PLM）的常识没有失去充分利用，因为 Barnes et al. (2021)^[5] 没有很好解决图关系和字 Tokens 间的映射，导致其只能用 PLM 来生成字符 Embedding，且无奈跟模型一起训练。

事实上，跨语言的 PLM 蕴含对于不同语言之间交互的丰盛信息。其次，上述数据驱动的模型依赖于大量标注数据，但在实在场景中往往是标注数据有余或者甚至没有标注数据。例如，在本次工作中，MultiBEU (Barnes et al., 2018)^[4] 的训练集只有 1063 个样本，相似的 MultiBCA (Barnes et al., 2018)^[4] 的训练集只有 1174 个样本。本次工作的跨语言子工作要求不能应用目标语言的训练数据，也重大制约了该办法的性能。

为了解决上述提到的问题，咱们提出了一个对立的端到端 SSA 模型（图 2），把 PLM 作为模型骨干（Backbone）参加到整个端到端的训练中，并且利用数据加强办法和辅助工作来大幅晋升跨语言 zero-shot 场景的成果。

具体地，咱们采纳 XLM-RoBERTa (Conneau and Lample, 2019; Conneau et al., 2019)^[10,11] 作为模型的骨干编码器（Backbone Encoder），以充分利用其已有的多语言 / 跨语言常识；应用 BiLSTM^[12]增强序列解码能力；最初一个双线性注意力矩阵（Bilinear Attention）建模依存图，解码出观点四元组。为了缓解不足标注数据的问题，咱们采纳了两种数据加强办法：一种是在训练阶段增加雷同工作的雷同畛域（In-Domain）的标注数据，另一种是利用 XLM-RoBERTa 通过掩码语言模型（MLM）(Devlin et al., 2018)^[13] 生成加强样本（Augmented Samples）。

此外，咱们还增加了两个辅助工作：1）序列标注工作（Sequence Labeling）以预测文本中 Holder/Target/Expression 的片段，以及 2）情感极性分类（Polarity Classification）。这些辅助工作都不须要额定的标注。

以后有很多种预训练模型可作为模型骨干，例如 Multilingual BERT (mBERT) (Devlin et al., 2018)^[13]、XLM-RoBERTa (Conneau et al., 2019)^[10] 和 infoXLM(Chi et al., 2021）^[9]。咱们抉择 XLM-RoBERTa。因为 Monolingual 工作波及五种语言的意料，Crosslingual 工作是一个跨语言零样本问题，这两个工作都受害于 XLM-RoBERTa 的多语言训练文本和翻译语言建模 (Translation Language Model, TLM) 训练指标。

XLM 系列模型中的 TLM 和 Masked Language Modeling (MLM) 指标的性能优于 mBERT，后者仅应用 MLM 指标在多语言语料库上进行训练。此外，XLM-RoBERTa 提供了 Large 版本，模型更大，训练数据更多，这使其在上游工作的性能更好。咱们没有应用 infoXLM，因为它着重于句子级的分类指标，不适宜本次结构化预测的工作。

为了证实跨语言预训练语言模型 XLM-RoBERTa 的有效性，咱们将其与以下基线进行了比拟：1）w2v + BiLSTM，word2vec(Mikolov et al., 2013)^[20] 词嵌入和 BiLSTMs；2) mBERT，多语言 BERT(Devlin et al., 2018)^[13]；3）mBERT + BiLSTM；4) XLM-RoBERTa + BiLSTM。表 1 表明 XLM-RoBERTa + BiLSTM 在所有基准测试中取得了最佳性能，均匀得分比最强基线 (mBERT + BiLSTM) 高 6.7%。BiLSTM 能够进步 3.7% 的性能，这表明 BiLSTM 层能够捕捉序列信息，这有利于序列化的信息编码 (Cross and Huang, 2016)^[12]。

咱们应用官网公布的开发集作为测试集，将原始训练集随机拆分为训练集和开发集。并放弃拆分开发集的大小与官网公布的开发集雷同。

数据加强（DA1）- 同畛域数据合并

不同语种的 M 个数据集如果属于雷同的畛域，能够合并作为一个大训练集以晋升各个子数据集的成果。本次评测有四个同属于酒店评论的数据集 MultiBEU、MultiBCA、OpeNerES、OpeNerEN (Agerri et al., 2013)^[1]，咱们在训练阶段组合了这些属于同一畛域的不同数据集，能够进步各个数据集的成果。咱们还额定增加了葡萄牙语的酒店评论数据集 (BOTE-rehol) (Barros and Bona, 2021)^[7]。咱们察看到这些数据集尽管语种不同，但共享一些类似特色。

具体地说，这些数据集所属的语言对一些雷同的对象或概念共享相近的词（从拉丁字母相似性的角度看）。例如，加泰罗尼亚语和西班牙语对“酒店”的示意跟英文一样都是“hotel”；在巴斯克语中“酒店”则是一个类似的词“hotela”。此外，人们在酒店评论畛域具备雷同的情感极性偏向，比方对“优质的服务”和“洁净整洁的空间”示意赞叹。其中 MultiBEU 数据集是数据量起码的数据集，可能通过更多的数据加强取得更多晋升。

数据加强（DA2）- 通过掩码语言模型生成新样本

掩码语言模型（Mask Language Model）在预训练阶段应用 [MASK] 标记随机替换原始文本 tokens，训练指标就是在 [MASK] 地位预测原始 tokens。对于每个具备无效观点四元组的样本，咱们随机掩码训练集文本中的一小部分 tokens，并应用在工作数据集上预训练过的 XLM-RoBERTa 在这些掩码过的样本上生成新的 tokens，这样咱们就取得了带标签的新样本。但要留神不能在 Express 片段上进行掩码生成，因为模型可能会生成与原始标签极性不同的词。

从表 3 和表 4 能够看到两种数据加强办法都有助于进步性能，简直每个基准测试的性能都有所提高。特地是对 Crosslingual 工作的性能有显着进步，揣测是因为 Zero-shot 工作没有机会在训练阶段看过同数据集的训练样本的文本和标签。DA2 办法能晋升 Crosslingual 工作的成果，然而对 Monolingual 工作的作用不大，揣测是因为 Monolingual 工作的曾经在训练阶段看过同数据集的训练样本了。

和其余团队的后果相比，咱们在平均分以及多个子数据集上有劣势。在 Subtask-2（表 7）的 Zero-shot 数据集上，相比第二名平均分高了 5.2pp。在 Subtask-1（表 6）上多个数据集（MultiBEU , MultiBCA, OpeNerES, 和 OpeNerEN）排名第一，平均分相比第一名仅有 0.3pp 的差距。

本次评测，咱们次要摸索了结构化情感剖析的工作。针对不同语言数据间不足交互、以及标注资源不足的问题，咱们利用了跨语言预训练语言模型，并采纳了两种数据加强办法和两种辅助工作。试验证实了咱们的办法和模型的有效性，并在 SemEval-2022 工作 10 结构化情感剖析（Structured Sentiment Analysis）获得 Subtask- 1 第二名（表 6）和 Subtask- 2 第一名（表 7）的问题。后续将持续摸索其余更无效的多语言 / 跨语言资源和跨语言预训练模型的利用办法。咱们正在尝试将较量中的技术利用到美团具体业务中，如语音交互部的智能客服、智能外呼机器人中，为优化智能解决能力、晋升用户满意度提供参考。

[1] Rodrigo Agerri, Montse Cuadros, Sean Gaines, and German Rigau. 2013. OpeNER: Open polarity enhanced named entity recognition. In Sociedad Española para el Procesamiento del Lenguaje Natural, volume 51, pages 215–218.
[2] Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, and Eneko Agirre. 2020. Give your text representation models some love: the case for basque. In Proceedings of the 12th International Conference on Language Resources and Evaluation.
[3] Jordi Armengol-Estapé, Casimiro Pio Carrino, Carlos Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor Gonzalez-Agirre, Maite Melero, and Marta Villegas. 2021. Are multilingual models the best choice for moderately underresourced languages? A comprehensive assessment for Catalan. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4933–4946, Online. Association for Computational Linguistics.
[4] Jeremy Barnes, Toni Badia, and Patrik Lambert. 2018. MultiBooked: A corpus of Basque and Catalan hotel reviews annotated for aspect-level sentiment classification. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation(LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
[5] Jeremy Barnes, Robin Kurtz, Stephan Oepen, Lilja Øvrelid, and Erik Velldal. 2021. Structured sentiment analysis as dependency graph parsing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3387–3402, Online. Association for Computational Linguistics.
[6] Jeremy Barnes, Oberländer Laura Ana Maria Kutuzov, Andrey and, Enrica Troiano, Jan Buchmann, Rodrigo Agerri, Lilja Øvrelid, Erik Velldal, and Stephan Oepen. 2022. SemEval-2022 task 10: Structured sentiment analysis. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval2022), Seattle. Association for Computational Linguistics.
[7] José Meléndez Barros and Glauber De Bona. 2021. A deep learning approach for aspect sentiment triplet extraction in portuguese. In Brazilian Conference on Intelligent Systems, pages 343–358. Springer.
[8] José Cañete, Gabriel Chaperon, Rodrigo Fuentes, JouHui Ho, Hojin Kang, and Jorge Pérez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020.
[9] Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and M. Zhou. 2021. Infoxlm: An information-theoretic framework for cross-lingual language model pre-training. In NAACL.
[10] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
[11] Alexis Conneau and Guillaume Lample. 2019. Crosslingual language model pretraining. Advances in neural information processing systems, 32.
[12] James Cross and Liang Huang. 2016. Incremental parsing with minimal features using bi-directional lstm. ArXiv, abs/1606.06406.
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[14] Timothy Dozat and Christopher D Manning. 2016. Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734.
[15] E. Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional lstm feature representations. Transactions of the Association for Computational Linguistics, 4:313–327.
[16] Robin Kurtz, Stephan Oepen, and Marco Kuhlmann. 2020. End-to-end negation resolution as graph parsing. In IWPT.
[17] Xin Li, Lidong Bing, Piji Li, and Wai Lam. 2019. A unified model for opinion target extraction and target sentiment prediction. ArXiv, abs/1811.05082.
[18] Bing Liu. 2012. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1):1–167.
[19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
[20] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In ICLR.
[21] Margaret Mitchell, Jacqui Aguilar, Theresa Wilson, and Benjamin Van Durme. 2013. Open domain targeted sentiment. In EMNLP.
[22] Stephan Oepen, Omri Abend, Lasha Abzianidze, Johan Bos, Jan Hajic, Daniel Hershcovich, Bin Li, Timothy J. O’Gorman, Nianwen Xue, and Daniel Zeman. 2020. Mrp 2020: The second shared task on crossframework and cross-lingual meaning representation parsing. In CONLL.
[23] Lilja Ovrelid, Petter Maehlum, Jeremy Barnes, and Erik Velldal. 2020. A fine-grained sentiment dataset for norwegian. In LREC.
[24] Lilja Øvrelid, Petter Mæhlum, Jeremy Barnes, and Erik Velldal. 2020. A fine-grained sentiment dataset for Norwegian. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 5025– 5033, Marseille, France. European Language Resources Association.
[25] Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis. Foundations and Trends® in information retrieval, 2(1–2):1–135.
[26] Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Haris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval-2014 task 4: Aspect based sentiment analysis. In COLING 2014.
[27] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
[28] Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683.
[29] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681.
[30] Cigdem Toprak, Niklas Jakob, and Iryna Gurevych. 2010. Sentence and expression level annotation of opinions in user-generated discourse. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 575–584, Uppsala, Sweden. Association for Computational Linguistics.
[31] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. ArXiv, abs/1706.03762.
[32] Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2-3):165–210.
[33] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
[34] Lu Xu, Hao Li, Wei Lu, and Lidong Bing. 2020. Position-aware tagging for aspect sentiment triplet extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2339–2349, Online. Association for Computational Linguistics.
[35] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS.
[36] Elena Zotova, Rodrigo Agerri, Manuel Nunez, and German Rigau. 2020. Multilingual stance detection: The catalonia independence corpus. arXiv preprint arXiv:2004.00050.

浏览美团技术团队更多技术文章合集

前端 | 算法 | 后端 | 数据 | 平安 | 运维 | iOS | Android | 测试

| 在公众号菜单栏对话框回复【2021 年货】、【2020 年货】、【2019 年货】、【2018 年货】、【2017 年货】等关键词，可查看美团技术团队历年技术文章合集。

| 本文系美团技术团队出品，著作权归属美团。欢送出于分享和交换等非商业目标转载或应用本文内容，敬请注明“内容转载自美团技术团队”。本文未经许可，不得进行商业性转载或者应用。任何商用行为，请发送邮件至 tech@meituan.com 申请受权。

关于美团:美团SemEval2022结构化情感分析跨语言赛道冠军方法总结

1. 背景

2. 赛题简介

数据介绍

评估指标

3. 现有办法和问题

4. 咱们的办法

5. 办法实现和试验剖析

5.1 模型抉择

5.2 数据加强

5.3 辅助工作

6. 与其余参赛队伍成果比照

7. 总结

9. 参考文献