论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries
2018-09-18 09:58:50

Paper：http://openaccess.thecvf.com/content_ECCV_2018/papers/Edgar_Margffoy-Tuay_Dynamic_Multimodal_Instance_ECCV_2018_paper.pdf

GitHub：https://github.com/BCV-Uniandes/query-objseg (PyTorch)

　　Code: https://github.com/chenxi116/TF-phrasecut-public (Tensorflow)

2. Segmentation from Natural Language Expressions　　ECCV 2016

论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

本文就是在给定 language 后，从图像中分割出所对应的目标物体。所设计的 model，如下所示：

论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

1. Visual Module (VM) ：

　　本文采用 Dual Path Network 92 (DPN92) 来提取 visual feature；

论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

2. Language Module (LM)：

本文采用的是 sru，一种新型的快速的 sequential 网络结构。sru 定义为：

论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

我们把 embedding 以及 hidden state 进行 concatenate，然后得到文本中每一个单词的表达，即： r_t. 有了这个之后，我们基于 rt 来计算一系列的动态滤波 f_k,t，定义为：

论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

这样，我们可以根据文本 w，就可以得到文本的特征表达以及对应的动态滤波，即：

论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

3. Synthesis Module (SM)：

SM 是我们框架的核心，用于融合多个模态的信息。如图5所示，我们首先将 I_N以及空间位置的表达，进行 concatenate，然后用 dynamic filter 对这个结果进行卷积，得到一个响应图，RESP，由 K 个 channel 组成。下一步，我们将 I_N，LOC，以及 F_t 沿着 channel dimension 进行 concatenate，得到一个表达 I’。最终，我们用 1*1 的卷积来融合所有的信息，每一个时间步骤，我们有一个输出，即作为 M_t，最终，表达为：

论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

下一步，我们用 mSRU 来产生一个 3D 的 tensor。

论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

4. Upsampling Module (UM) :

最终，我们采用上采样的方式，得到最终分割的 map 结果。

论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

===== 几点疑问：

1. 作者将 spatial LOC 的信息也结合到网络中？

The same operation can also be found from the reference papers:

1. Segmentation from Natural Language Expressions ECCV 2016

2. Recurrent Multimodal Interaction for Referring Image Segmentation ICCV 2017

　　In the paper "Segmentation from Natural Language Expressions", I find the following parts to explain why we should use the spatial location information and concatenate with image feature maps.

论文笔记：Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries

2. Run the code successfully.

 wangxiao@AHU:/DMS$ python3 -u -m dmn_pytorch.train --backend dpn92 --num-filters 10 --lang-layers 3 --mix-we --accum-iters 1

 /usr/local/lib/python3.6/site-packages/torch/utils/cpp_extension.py:118: UserWarning: 

                                !! WARNING !!

 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

 Your compiler (c++) may be ABI-incompatible with PyTorch!

 Please use a compiler that is ABI-compatible with GCC 4.9 and above.

 See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.

 See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6

 for instructions on how to install GCC 4.9 or higher.

 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                               !! WARNING !!

   warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))

 Argument list to program

 --data /DMS/referit_data

 --split_root /DMS/referit_data/referit/splits/referit

 --save_folder weights/

 --snapshot weights/qseg_weights.pth

 --num_workers 2

 --dataset unc

 --split train

 --val None

 --eval_first False

 --workers 4

 --no_cuda False

 --log_interval 200

 --backup_iters 10000

 --batch_size 1

 --epochs 40

 --lr 1e-05

 --patience 2

 --seed 1111

 --iou_loss False

 --start_epoch 1

 --optim_snapshot weights/qsegnet_optim.pth

 --accum_iters 1

 --pin_memory False

 --size 512

 --time -1

 --emb_size 1000

 --hid_size 1000

 --vis_size 2688

 --num_filters 10

 --mixed_size 1000

 --hid_mixed_size 1005

 --lang_layers 3

 --mixed_layers 3

 --backend dpn92

 --mix_we True

 --lstm False

 --high_res False

 --upsamp_mode bilinear

 --upsamp_size 3

 --upsamp_amplification 32

 --dmn_freeze False

 --visdom None

 --env DMN-train

 Processing unc: train set

 loading dataset refcoco into memory...

 creating index...

 index created.

 DONE (t=5.78s)

 Saving dataset corpus dictionary...

 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 42404/42404 [10:18<00:00, 68.56it/s]

 Processing unc: val set

 loading dataset refcoco into memory...

 creating index...

 index created.

 DONE (t=21.52s)

 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3811/3811 [00:53<00:00, 71.45it/s]

 Processing unc: trainval set

 loading dataset refcoco into memory...

 creating index...

 index created.

 DONE (t=4.97s)

 0it [00:00, ?it/s]

 Processing unc: testA set

 loading dataset refcoco into memory...

 creating index...

 index created.

 DONE (t=5.24s)

 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1975/1975 [00:27<00:00, 72.62it/s]

 Processing unc: testB set

 loading dataset refcoco into memory...

 creating index...

 index created.

 DONE (t=5.06s)

 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1810/1810 [00:31<00:00, 57.91it/s]

 Train begins...

 [ 1] ( 0/120624) | ms/batch 456.690311 | loss 3.530792 | lr 0.0000100

 [ 1] ( 200/120624) | ms/batch 273.972313 | loss 1.487153 | lr 0.0000100

 [ 1] ( 400/120624) | ms/batch 257.813077 | loss 1.036689 | lr 0.0000100

 [ 1] ( 600/120624) | ms/batch 251.565860 | loss 1.047311 | lr 0.0000100

 [ 1] ( 800/120624) | ms/batch 249.070073 | loss 1.657688 | lr 0.0000100

 [ 1] ( 1000/120624) | ms/batch 246.906650 | loss 1.815347 | lr 0.0000100

 [ 1] ( 1200/120624) | ms/batch 245.645234 | loss 2.601908 | lr 0.0000100

 [ 1] ( 1400/120624) | ms/batch 245.039105 | loss 1.495383 | lr 0.0000100

 [ 1] ( 1600/120624) | ms/batch 244.460579 | loss 1.441855 | lr 0.0000100