STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification

时间:2024-04-12 09:13:02

Abstract

In this work, we propose a novel Spatial-Temporal Attention (STA) approach to tackle the large-scale person re-identification task in videos. Different from the most existing methods, which simply compute representations of video clips using frame-level aggregation (e.g. average pooling), the proposed STA adopts a more effective way for producing robust clip-level feature representation. Concretely, our STA fully exploits those discriminative parts of one target person in both spatial and temporal dimensions, which results in a 2-D attention score matrix via inter-frame regularization to measure the importances of spatial parts across different frames. Thus, a more robust clip-level feature representation can be generated according to a weighted sum operation guided by the mined 2-D attention score matrix. In this way, the challenging cases for video-based person re-identification such as pose variation and partial occlusion can be well tackled by the STA. We conduct extensive experiments on two large-scale benchmarks, i.e. MARS and DukeMTMC-VideoReID. In particular, the mAP reaches 87.7% on MARS, which significantly outperforms the state-of-the-arts with a large margin of more than 11.6%.

问题

Video Person ReID,给定probe视频(RGB),排序gallery中的视频。

方法

提出了一个新的网络STA来解决video person re-id问题,文章列举的创新点有:
1)对每个spatial region给予权重,这能够同时做到discriminative part mining和frame selection;
相比于AAAI2018的Region-based Quality Estimation Network改进了part-based attention,确实在一定程度更加符合现在reid中使用part level特征。这里是不是可以考虑对part特征的选择进行研究,例如使用deformable或者local的attention?不过这个在特征提取过程中已经做了一部分操作,效果不一定有提升。
2)提出一个inter-frame regularization,用来约束不同帧之间需要不类似。
3)新的特征融合方法。
STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
整体看上去并不复杂,方法总览如下:
STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
细节上:
1、特征提取:
ResNet50的最后一层的stride需要调整
STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
这里公式有点问题,文字意思是先对每个point求特征向量的norm的平方,而后在spatial上面做L2 normalization。之后做的L1 normalization和公式也对不上,公式上只是算了每个spatial block的l1 norm。
STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
在得到了每个spatial block的attention score之后,对相同spatial region的score做l1 normalization。
STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
2、regularization的公式,应用在上,注意正则化项只是随机取了两帧计算的,公式在后面。

STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
3、特征融合
STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
使用highest score得到的第一个feature map,使用attention score加权得到第二个feature map,然后使用global average pooling和fc得到最后的特征。
4、算法表
STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
算法表更清晰,但是有一些小错误。

结果

STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification
从ablation study看出提出的component: STA,Fusion,Reg都能提高结果。