论文笔记:[ACL2016]End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

时间:2022-09-13 10:50:01

文章:Ma X, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF[J]. 2016. 发表在ACL2016上。我认为这是一篇写的非常清楚的文章,它要解决的是序列标注的问题。下面是我的阅读笔记。

1. Introduction




  • 非常依赖手工特征(hand-craft features)
  • 任务相关的资源

这导致models difficult to adapt to new tasks or new domains.


近些年有一些非线性神经网络模型用词向量(Word Embedding)作为输入,颇为成功。有前馈神经网络、循环神经网络(RNN)、长短期记忆模型(LSTM)、GRU,取得了很有竞争力的结果。


它们把词向量作为参数而不是取代手工特征,作者说如果仅依赖词向量,效果将变得很差。“Their performance drops rapidly when the models solely depend on neural embeddings.”我不明白作者说的solely depend on 说的是什么,没有参考文献。


  1. a novel neural network architecture for linguistic sequence labeling:
  2. empirical evaluations on benchmark data sets for two classic NLP tasks.
  3. state-of-the-art performance with truly end-to-end system.

作者强调 end-to-end 的价值:

  • no task-specific resources,
  • no feature engineering,
  • no data pre-processing beyond pre-trained word embeddings on unlabeled corpora.

2. 本文方法

步骤一:用Character-level CNN获得词表示。
步骤二:把步骤一的词表示和事先训练好的词向量拼接起来,输入Bi-directional LSTM,得到每个状态的表示。注意,BLSTM的输入和输出都过了Dropout层(下图未画出)。



3. 模型训练

因为包含CNN的部分,GPU自然是要的,作者介绍使用了GeForce GTX TITAN X GPU,POS tagging用了12小时,NER用了8小时。在这一部分,作者介绍了各种参数的配置、优化算法,其中就有很多trick。这里的介绍将结合论文的第三模块模型训练和第四模块实验一起来看

  • Glove: 100-dimensional embeddings trained on 6 billion words from Wikipedia and web text (Pennington et al., 2014) http://nlp.stanford.edu/projects/glove/
  • Senna: 50-dimensional embeddings trained on Wikipedia and Reuters RCV-1 corpus (Collobert et al., 2011) http://ronan.collobert.com/senna/
  • Word2Vec: 300-dimensional embeddings trained on 100 billion words from Google News (Mikolov et al., 2013) https://code.google.com/archive/p/word2vec/
  • Random: 100-dimensional embeddings uniformly sampled from range [-\sqrt \frac{3}{dim}, +\sqrt \frac{3}{dim}] where dim is the dimension of embeddings.
  • Random: 30-dimensional embeddings uniformly sampled from range[-\sqrt \frac{3}{dim}, +\sqrt \frac{3}{dim}] where dim is the dimension of embeddings.
  • Random: Matrix parameters are randomly initialized with uniform
    samples from[-\sqrt \frac{6}{r+c}, +\sqrt \frac{6}{r+c}] where r and c are the number of of rows and columns in the structure

  • SGD: with batch size 10 and momentum 0.9, learning rate of η0 (η0 = 0.01 for POS tagging, and 0.015 for NER. To reduce the effects of “gradient exploding”, we use a gradient clipping of 5.0 (Pascanu et al., 2012). Best in experiments.
  • AdaDelta:(Zeiler, 2012)
  • Adam: (Kingma and Ba, 2014)
  • RMSProp: (Dauphin et al., 2015)
Early Stopping

We use early stopping (Giles, 2001; Graves et al., 2013) based on performance on validation sets. The “best” parameters appear at around 50 epochs, according to our experiments.

Fine Tuning

For each of the embeddings, we fine-tune initial embeddings, modifying them during gradient updates of the neural network model by back-propagating gradients. The effectiveness of this method has been previously explored in sequential and structured prediction problems (Collobert et al., 2011; Peng and Dredze, 2015).

Dropout Training

To mitigate overfitting, we apply the dropout method (Srivastava et al., 2014) to regularize our model. We fix dropout rate at 0.5 for all dropout layers through all the experiments. We obtain significant improvements on model performance after using dropout.
加上Dropout layer在两个任务上都有提升。
Tuning Hyper-Parameters

We tune the hyper-parameters on the development sets by random search. Due to time constrains it is infeasible to do a random search across the full hyper-parameter space. Thus, for the tasks of POS tagging and NER we try to share as many hyper-parameters as possible. Note that the final hyper-parameters for these two tasks are almost the same, except the initial learning rate. We set the state size of LSTM to 200. Tuning this parameter did not significantly impact the performance of our model. For CNN, we use 30 filters with window\,length 3.
4. Experiments

Data Sets

  • POS Tagging: Wall Street Journal (WSJ) portion of Penn Treebank (PTB) (Marcus et al., 1993)
  • NER: English data from CoNLL 2003 shared task (Tjong Kim Sang and De Meulder, 2003)

  • 对比1-2行,BRNN不如BLSTM
  • 对比2-4行,仅BLSTM不是最好
  • 对比3-4行,BLSTM-CNN效果相对仅BLSTM有提升,依然不是最好
  • 第4行,(我认为作者此处笔误,应该是BLSTM-CNN-CRF)本文推荐的模型组合在两个任务上都取得了最好的效果
  • 但作者未论述仅CRF的效果,逻辑上说,也许只要CRF就能取得好效果,BLSTM-CNN可以不要呢,虽然常识感觉应该仅CRF不好,但论述不够严谨

Word Embeddings


  • NER relies more heavily on pretrained embeddings than POS tagging.
  • 可能由于Word2Vec 的训练用的是Case Sensitive的方式,所以在NER任务上效果就差了。


  • 应该用case insensitive的方式训练Word2Vec再得出结果)
  • 词向量的dimensionality、训练语料都不同,对比并不十分严谨

OOV Error Analysis

作者用了Character-Level CNN,自然很适合解决Out-of-Vocabulary (OOV) 的问题。本文对OOV问题有进行了进一步拆解,定义了以下几种:

  • in-vocabulary words(IV)
  • out-of-training-vocabulary words (OOTV)
  • out-of-embedding-vocabulary words (OOEV)
  • out-of-both-vocabulary words (OOBV)


5. Conclusion


  • It is a truly end-toend model relying on no task-specific resources, feature engineering or data pre-processing
  • achieved state-of-the-art performance on two linguistic sequence labeling tasks, comparing with previously state-of-the-art systems

Future Work

  • further improved by exploring multi-task learning approaches to combine more useful and correlated information
  • apply our model to data from other domains such as social media


