视觉语言两开花！谷歌提出全新视觉语言桥梁( 二 )

可控的图像描述

本文插图
鼠标轨迹信号将模型的注意力集中在特定区域，上图中（a）为标准的图像描述，（b）和（c）都是使用定位叙事的可控图像描述，只是它们的鼠标轨迹不同，进而根据鼠标的不同滑动顺序轨迹生成了不同的文字描述。
消融研究
总结
本文提出了一种全新的图像描述标注方法，其中的每个单词都通过鼠标轨迹确定了较为准确的视觉基础，甚至为一些物体之间的关系也进行了建模和表示。
【视觉语言两开花！谷歌提出全新视觉语言桥梁】引用
[1] Google cloud speech-to-text API. https://cloud.google.com/speech-to-text/
[2] Kruskal, J.B., Liberman, M.: The symmetric time-warping problem: from contin- uous to discrete. In: Time Warps, String Edits, and Macromolecules - The Theory and Practice of Sequence Comparison, chap. 4. CSLI Publications (1999)
[3] Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dolla ?r, P.: Microsoft COCO: Common objects in context. In: ECCV (2014)
[4] Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)
[5] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Semantic un- derstanding of scenes through the ADE20K dataset. IJCV 127(3), 302–321 (2019)
[6] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Ka- mali, S., Popov, S., Malloci, M., Duerig, T., Ferrari, V.: The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982 (2018)