论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

时间:2022-09-13 10:50:01

文章:Ma X, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF[J]. 2016. 发表在ACL2016上。我认为这是一篇写的非常清楚的文章,它要解决的是序列标注的问题。下面是我的阅读笔记。

1. Introduction

本文要解决的是序列标注的问题,可用于POS、NER等任务。

传统方法

大部分传统的高效模型是线性统计模型,包括HMM,CRF等。

存在问题:
  • 非常依赖手工特征(hand-craft features)
  • 任务相关的资源

这导致models difficult to adapt to new tasks or new domains.

近期方法

近些年有一些非线性神经网络模型用词向量(Word Embedding)作为输入,颇为成功。有前馈神经网络、循环神经网络(RNN)、长短期记忆模型(LSTM)、GRU,取得了很有竞争力的结果。

存在问题:

它们把词向量作为参数而不是取代手工特征,作者说如果仅依赖词向量,效果将变得很差。“Their performance drops rapidly when the models solely depend on neural embeddings.”我不明白作者说的solely depend on 说的是什么,没有参考文献。

本文贡献

  1. a novel neural network architecture for linguistic sequence labeling:
  2. empirical evaluations on benchmark data sets for two classic NLP tasks.
  3. state-of-the-art performance with truly end-to-end system.

作者强调 end-to-end 的价值:

  • no task-specific resources,
  • no feature engineering,
  • no data pre-processing beyond pre-trained word embeddings on unlabeled corpora.

2. 本文方法

步骤一:用Character-level CNN获得词表示。
论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

步骤二:把步骤一的词表示和事先训练好的词向量拼接起来,输入Bi-directional LSTM,得到每个状态的表示。注意,BLSTM的输入和输出都过了Dropout层(下图未画出)。

步骤三:用步骤二的输出输入CRF层,最终预测。

总体的流程图如下:

论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

3. 模型训练

因为包含CNN的部分,GPU自然是要的,作者介绍使用了GeForce GTX TITAN X GPU,POS tagging用了12小时,NER用了8小时。在这一部分,作者介绍了各种参数的配置、优化算法,其中就有很多trick。这里的介绍将结合论文的第三模块模型训练和第四模块实验一起来看

参数配置
词向量
  • Glove: 100-dimensional embeddings trained on 6 billion words from Wikipedia and web text (Pennington et al., 2014) http://nlp.stanford.edu/projects/glove/
  • Senna: 50-dimensional embeddings trained on Wikipedia and Reuters RCV-1 corpus (Collobert et al., 2011) http://ronan.collobert.com/senna/
  • Word2Vec: 300-dimensional embeddings trained on 100 billion words from Google News (Mikolov et al., 2013) https://code.google.com/archive/p/word2vec/
  • Random: 100-dimensional embeddings uniformly sampled from range [-\sqrt \frac{3}{dim}, +\sqrt \frac{3}{dim}] where dim is the dimension of embeddings.
字符向量
  • Random: 30-dimensional embeddings uniformly sampled from range[-\sqrt \frac{3}{dim}, +\sqrt \frac{3}{dim}] where dim is the dimension of embeddings.
W和b
  • Random: Matrix parameters are randomly initialized with uniform
    samples from[-\sqrt \frac{6}{r+c}, +\sqrt \frac{6}{r+c}] where r and c are the number of of rows and columns in the structure
实验结论

Glove整体比较好,在NER里优势明显;Senna仅弱一点点,也很不错。
论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

优化算法
梯度下降方法
  • SGD: with batch size 10 and momentum 0.9, learning rate of η0 (η0 = 0.01 for POS tagging, and 0.015 for NER. To reduce the effects of “gradient exploding”, we use a gradient clipping of 5.0 (Pascanu et al., 2012). Best in experiments.
  • AdaDelta:(Zeiler, 2012)
  • Adam: (Kingma and Ba, 2014)
  • RMSProp: (Dauphin et al., 2015)
Early Stopping

We use early stopping (Giles, 2001; Graves et al., 2013) based on performance on validation sets. The “best” parameters appear at around 50 epochs, according to our experiments.

Fine Tuning

For each of the embeddings, we fine-tune initial embeddings, modifying them during gradient updates of the neural network model by back-propagating gradients. The effectiveness of this method has been previously explored in sequential and structured prediction problems (Collobert et al., 2011; Peng and Dredze, 2015).

Dropout Training

To mitigate overfitting, we apply the dropout method (Srivastava et al., 2014) to regularize our model. We fix dropout rate at 0.5 for all dropout layers through all the experiments. We obtain significant improvements on model performance after using dropout.
加上Dropout layer在两个任务上都有提升。
论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

Tuning Hyper-Parameters

We tune the hyper-parameters on the development sets by random search. Due to time constrains it is infeasible to do a random search across the full hyper-parameter space. Thus, for the tasks of POS tagging and NER we try to share as many hyper-parameters as possible. Note that the final hyper-parameters for these two tasks are almost the same, except the initial learning rate. We set the state size of LSTM to 200. Tuning this parameter did not significantly impact the performance of our model. For CNN, we use 30 filters with window\,length 3.
论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

4. Experiments

Data Sets

  • POS Tagging: Wall Street Journal (WSJ) portion of Penn Treebank (PTB) (Marcus et al., 1993)
  • NER: English data from CoNLL 2003 shared task (Tjong Kim Sang and De Meulder, 2003)

数据集的统计如下
论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

效果

下图显示,在两个任务上,本文都取得了state-of-the-art的效果。
论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

模块必要性

本文实际使用了CNN、BLSTM、CRF三个主要模块,每个模块能否不要、能否换掉,通过作者下面的对比试验可以得出结论:

  • 对比1-2行,BRNN不如BLSTM
  • 对比2-4行,仅BLSTM不是最好
  • 对比3-4行,BLSTM-CNN效果相对仅BLSTM有提升,依然不是最好
  • 第4行,(我认为作者此处笔误,应该是BLSTM-CNN-CRF)本文推荐的模型组合在两个任务上都取得了最好的效果
  • 但作者未论述仅CRF的效果,逻辑上说,也许只要CRF就能取得好效果,BLSTM-CNN可以不要呢,虽然常识感觉应该仅CRF不好,但论述不够严谨

论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

各类参数设置

Word Embeddings

结论:

  • NER relies more heavily on pretrained embeddings than POS tagging.
  • 可能由于Word2Vec 的训练用的是Case Sensitive的方式,所以在NER任务上效果就差了。

但如果细究:

  • 应该用case insensitive的方式训练Word2Vec再得出结果)
  • 词向量的dimensionality、训练语料都不同,对比并不十分严谨

论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

Dropout

加上Dropout layer在两个任务上都有提升。
论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

OOV Error Analysis

作者用了Character-Level CNN,自然很适合解决Out-of-Vocabulary (OOV) 的问题。本文对OOV问题有进行了进一步拆解,定义了以下几种:

  • in-vocabulary words(IV)
  • out-of-training-vocabulary words (OOTV)
  • out-of-embedding-vocabulary words (OOEV)
  • out-of-both-vocabulary words (OOBV)

作者先统计了各类问题出现的频次:

论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

之后展示了在不同类型的问题上,两类方法的效果对比:

论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

虽然有好有差,但都在0.8以上,已经很不错了。

5. Conclusion

Conclusions

  • It is a truly end-toend model relying on no task-specific resources, feature engineering or data pre-processing
  • achieved state-of-the-art performance on two linguistic sequence labeling tasks, comparing with previously state-of-the-art systems

Future Work

  • further improved by exploring multi-task learning approaches to combine more useful and correlated information
  • apply our model to data from other domains such as social media

参考文献

  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP-2014, pages 1532–1543, Doha, Qatar, October.
  • Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537.
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • James Bergstra, Olivier Breuleux, Fred eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, andYoshua Bengio. 2010. Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX.
  • Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701
  • Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Yann N Dauphin, Harm de Vries, Junyoung Chung, and Yoshua Bengio. 2015. Rmsprop and equilibrated adaptive learning rates for non-convex optimization. arXiv preprint arXiv:1502.04390.
  • Rich Caruana Steve Lawrence Lee Giles. 2001. Overfitting in neural nets: Backpropagation, conjugategradient, and early stopping. In Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference, volume 13, page 402. MIT Press
  • Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of ICASSP-2013, pages 6645–6649. IEEE.
  • Nanyun Peng and Mark Dredze. 2015. Named entity recognition for chinese social media with jointly trained embeddings. In Proceedings of EMNLP-2015, pages 548–554, Lisbon, Portugal, September.
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
  • Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313–330.
  • Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In
    Proceedings of CoNLL-2003 - Volume 4, pages 142–147, Stroudsburg, PA, USA