这篇论文是CVPR2019的oral，将graph learning引入了person search task中，文章很不错，学习一下~

1. Introduction

　　这篇文章对图像中的context information（上下文信息）做了进一步的挖掘，利用其辅助person search的决策。其基本思想是寻找query和gallery图像对中都存在的人（不只是target person），比如说Figure 3中在婚礼场景中，我们要判断的是新郎（红框）是不是同一个人（这句话怪怪的~），但是这两张照片中都出现的新娘和花童（绿框）显然可以帮助我们更好地决策，这就是文章所说的context information。很多情况下，特别是多摄像头同时捕捉，或者临近时刻内拍摄的画面，两景画面中存在多对same dentities。那么，这些行人在两幅图像中都存在，能够帮助我们更好的决策。特别是当identity本身的特征判别性不够高时，其周边的其它行人可以很好地辅助。那么同时，缺点就是如果时间跨度比较大，target person处在不同环境中，特别是在室外监控视频中，这种context information的作用就小很多。

　　一个probe-gallery的图像对：对于一个给定的target person（query图像中标定的target person？），①首先通过contextual instance expansion模块来寻找context information，具体即是搜集场景中所有的行人作为context candidates；②用relative attention模块进行筛选，这一步考虑probe和gallery图像中所有的context candidates，输出matched pairs，作为informative context；③构建context graph，利用graph learning框架来计算query和gallery中target pair的相似性，其中graph node包含target pair和context pairs，所有的context nodes都与target node相连。个人更直白的理解是①行人的检测；②对query和gallery中检测的行人进行粗匹配，得到匹配对；③由target pair和context pairs构建图，综合判断target pair的相似性。

　　整个框架搭建在第一篇person search via deep learning的框架基础上，如Figure 1所示。

[论文笔记]CVPR2019_Learning Context Graph for Person Search

3. Methodology

3.1 Overview

核心思想是扩展instance feature的表达能力，不再局限在目标行人本身上提取特征，也将周围行人作为特征学习进去。

Instance Detection and Feature Learning

主要是对faster R-CNN框架加以改进，进行联合的行人检测和特征学习。这一部分将person re-ID中的part-based特征学习框架纳入进来，以提升特征学习的判别能力。

Contextual Instance Expansion

将query和gallery图像中所有的instance pairs作为context candidates，利用relative attention layer来衡量context pair之间的视觉相似性，并筛选出足够高confidence的instance pairs作为informative contexts。

Contextual Graph Representation Learning

对于一对probe-gallery图像对，构建图来计算target pair的相似性。图节点包含着target persons以及相关的(associated) context pairs，它们之间用graph edges链接。用graph convolutional network来学习probe-gallery图像对的相似性。

3.2. Instance Detection and Feature Learning

3.2.1 Pedestrian Detection

　　这一部分的基本框架如Figure 2.ResNet-50作为基本网络，分成两个部分，图像先经过conv1-conv4_3做特征提取，输出1024 channel+1/16输入大小尺寸的feature map后，经过PPN（实际就是RPN）和NMS得到Region Proposals。和RPN一样，PPN也是用两个loss训练：一个binary softmax进行person/non-person二分类，一个linear layer做bbox regression。所有的proposals经过ROI Pooling后送入ResNet-50的第二部分conv4_4-conv5_3，随后经过Average Pooling得到2048维特征，再一分为二连接两个FC：一个是binary softmax layer进行person/non-person二分类，一个是256维的FC layer（即OIM文章中的id-feat），其输出会被进一步L2归一化，以用作inference的特征表达。

作者应该是将两张图像放进网络求proposals，然后分别输出

3.2.2 Region-based Feature Learning

除了global average pooling得到256d的特征之外，作者还引入了part-based re-ID的思想，将$7\times7$的特征划分成带重叠区的3个$7\times3$的区域。分别用FC输出256d的特征，这样就有了四个256d的特征，即Figure2中的下面这一部分图，这四个特征都用来计算OIM loss，

[论文笔记]CVPR2019_Learning Context Graph for Person Search

3.3. Contextual Instance Expansion

两张图像分别求得4个256d特征后，组合，输入Relative Attetion Block，输出4个weights

[论文笔记]CVPR2019_Learning Context Graph for Person Search

proposals之间的相似性可以用下式计算

[论文笔记]CVPR2019_Learning Context Graph for Person Search

注意这里的$x^r_i,x^r_j$是两张图像分别对应的4个part中的一对256d特征($R=4$)，$w_r$是上述输出的权重，这样可以计算part_based的相似性，这里用一个$L_{very}$来训练Relative Attetion Block：

[论文笔记]CVPR2019_Learning Context Graph for Person Search

到这里其实就可以做inference了，分别输入query和gallery图像，得到query person的id-feat和所有proposals的id-feat，和输出的weight一起加权算consine similarity再rank就可以了。但作者加入context information进一步辅助决策。

秒客网

[论文笔记]CVPR2019_Learning Context Graph for Person Search

1. Introduction

3. Methodology

3.1 Overview

Instance Detection and Feature Learning

Contextual Instance Expansion

Contextual Graph Representation Learning

3.2. Instance Detection and Feature Learning

3.2.1 Pedestrian Detection

3.2.2 Region-based Feature Learning

3.3. Contextual Instance Expansion

3.4. Contextual Graph Representation Learning

相关文章