Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks(paper)

时间:2023-02-01 10:44:23

本文重点:

和一般形式的文本处理方式一样,并没有特别大的差异,文章的重点在于提出了一个相似度矩阵

计算过程介绍:

query和document中的首先通过word embedding处理后获得对应的表示矩阵

利用CNN网络进行处理获得各自的feature map,接着pooling后获得query对应的向量表示Xq和document的向量Xd

不同于传统的Siamense网络在这一步利用欧式距离或余弦距离直接对Xq和Xd进行相似性计算后预测结果,网络采用一个相似矩阵来计算Xq和Xd的相似度,然后将Xd,Xq和sim(Xq,Xd)进行连接,并添加了word overlap和IDF word overlap的特征后作为特征向量输入一个神经网络层  --计算句子相似度的方法~基于字重叠(Word Overlap)

神经网络层的输出经过一个全连接层,利用softmax函数得出预测结果、

questions and documents are limited to a single sentence.

The main building blocks of our architecture are two distributional sentence models based on convolutional neural networks. These underlying sentence models work in parallel,mapping queries and documents to their distributional vectors,which are then used to learn the semantic similarity between them.

our model encodes query-document pairs in a rich representation using not only their similarity score but also their intermediate representations; (iii) the architecture of our network makes it straightforward to include any additional similarity features to the mode。

However, their model operates only on unigram or bigrams, while our architecture learns to extract and compose n-grams of higher degrees, thus allowing for capturing longer range dependencies. Additionally, our architecture uses not only the intermediate representations of questions and answers to compute their similarity but also includes them in the final representation, which constitutes a much richer representation of the question-answer pairs.

pointwise:
it is enough to train a binary classifier
pairwise:
the model is explicitly trained to score correct pairs higher than
incorrect pairs with a certain margin
对比:
it requires to consider a larger number of training instances (potentially quadratic in the size of the candidate document set) than the pointwise method, which may lead to slower training times. Still both pointwise and pairwise approaches ignore the fact that ranking is a prediction task on a list of objects.

Most often, producing a better representation ψ() that encodes various aspects of similarity between the input querydocument pairs plays a far more important role in training an accurate
reranker than choosing between different ranking approaches.Hence, in this paper we adopt a simple pointwise method to reranking and focus on modelling a rich representation of query-document pairs using deep learning approaches which is described next.

Our network is composed of a single wide convolutional layer followed by a non-linearity and simple max pooling.--宽卷积网络

The range of allowed values for i defines two types of convolution: narrow and wide. The narrow type restricts i to be in the range [1, |s| − m + 1], which in turn restricts the filter width to be ≤ |s|. To compute the wide type of convolution i ranges from 1 to |s| and sets no restrictions on the size of m and s. The benefits of one type of convolution over the other when dealing with text are discussed in detail in [18]. In short, the wide convolution is able to better handle words at boundaries giving equal attention to all words in the sentence, unlike in narrow convolution, where words close to boundaries are seen fewer times.More importantly, wide convolution also guarantees to always yield valid values even when s is shorter than the filter size m。

It should be noted that an alternative way of computing a convolution was explored in[18],where a series of convolutions are computed between each row of a sentence matrix and a corresponding row of the filter matrix. Essentially, it is a vectorized form of 1d convolution applied between corresponding rows of S and F. As a result, the output feature map is a matrix C ∈ R。

Among the most common choices of activation functions are the following: sigmoid (or logistic), hyperbolic tangent tanh, and a rectified linear (ReLU) function defined as simply max(0, x) to ensure that feature maps are always positive.

Both average and max pooling methods exhibit certain disadvantages: in average pooling, all elements of the input are considered, which may weaken strong activation values. This is especially critical with tanh non-linearity, where strong positive and negative activations can cancel each other out. The max pooling is used more widely and does not suffer from the drawbacks of average pooling. However, as shown in [40], it can lead to strong overfitting on the training set and, hence, poor generalization on the test data.

Recently, max pooling has been generalized to k-max pooling [18], where instead of a single max value, k values are extracted in their original order. This allows for extracting several largest activation values from the input sentence.

Our architecture for matching text pairs:
Our sentence models based on ConvNets learn to map input sentences to vectors, which
can then be used to compute their similarity. These are then usedto compute a query-document similarity score, which together withthe query and document vectors are joined in a single representation.

query和document的相似度的度量:
In this model, we seek a transformation of the candidate document xd = Mxd that is the closest
to the input query xq. The similarity matrix M is a parameter of the network and is optimized during the training。

Adagrad scales the learning rate of SGD on each dimension based on the l2 norm of the history of the error gradient. Adadelta uses both the error gradient history like Adagrad and the weight update history. It has the advantage of not having to set a learning rate at all.

参数大小:
the width m of the convolution filters is set to 5 and the number of convolutional feature maps is 100. We use ReLU activation function and a simple max-pooling.
--
To train the network we use stochastic gradient descent with shuffled mini-batches. We eliminate the need to tune the learning rate by using the Adadelta update rule [39]. The batch size is set to 50 examples. The network is trained for 25 epochs with early stopping, i.e., we stop the training if no update to the best accuracy on the dev set has been made for the last 5 epochs. The accuracy computed on the dev set is the MAP score. At test time we use the parameters of the network that were obtained with the best MAP score on the development (dev) set, i.e., we compute the MAP score after each 10 mini-batch updates and save the network
parameters if a new best dev MAP score was obtained. In practice, the training converges after a few epochs. We set a value for L2 regularization term to 1e−5 for the parameters of convolutional layers and 1e − 4 for all the others. The dropout rate is set to p = 0.5.
--
we keep the word embeddings fixed and initialize the word matrix W from an unsupervised neural language model.
--
We choose the dimensionality of our word embeddings to be 50 to be on the line with the deep
learning model of [38].
--
Word embeddings. We initialize the word embeddings by running word2vec tool [20] on the English Wikipedia dump and the AQUAINT corpus4 containing roughly 375 million words.
To train the embeddings we use the skipgram model with window size 5 and filtering words with frequency less than 5. The resulting model contains 50-dimensional vectors for about 3.5 million words. Embeddings for words not present in the word2vec model are randomly
initialized with each component sampled from the uniform
distribution U[−0.25, 0.25]. We minimally preprocess the data only performing tokenization
and lowercasing all words. To reduce the size of the resulting vocabulary V , we also replace all digits with 0. The size of the word vocabulary V for experiments using TRAIN set is 17,023 with approximately 95% of words initialized using wor2vec embeddings and the remaining 5% words are initialized at random as described in Sec.
--
Additional features. Given that a certain percentage of the words in our word embedding matrix are initialized at random (about 15%for the TRAIN-ALL) and a relatively small number of QA pairs prevents the network to directly learn them from the training data, similarity matching performed by the network will be suboptimal between many question-answer pairs.

In particular, we compute word overlap measures between each question-answer pair and include it as an additional feature vector xfeat in our model. This feature vector contains only four features: word overlap and IDF-weighted word overlap computed between all words and only non-stop words. Computing these features is straightforward and does not require additional pre-processing or external resources。

评估:
MRR :MRR is only looking at the rank of the first correct answer,hence it is more suitable in cases where for each question there is only a single correct answer.
MAP :examines the ranks of all the correct answers. It is computed as the mean over the average。