Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation. (arXiv:2305.04474v3 [cs.CV] UPDATED)


Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation. (arXiv:2305.04474v3 [cs.CV] UPDATED)
By: <a href="http://arxiv.org/find/cs/1/au:+Jiang_C/0/1/0/all/0/1">Chaoya Jiang</a>, <a href="http://arxiv.org/find/cs/1/au:+Ye_W/0/1/0/all/0/1">Wei Ye</a>, <a href="http://arxiv.org/find/cs/1/au:+Xu_H/0/1/0/all/0/1">Haiyang Xu</a>, <a href="http://arxiv.org/find/cs/1/au:+yan_M/0/1/0/all/0/1">Miang yan</a>, <a href="http://arxiv.org/find/cs/1/au:+Zhang_S/0/1/0/all/0/1">Shikun Zhang</a>, <a href="http://arxiv.org/find/cs/1/au:+Zhang_J/0/1/0/all/0/1">Jie Zhang</a>, <a href="http://arxiv.org/find/cs/1/au:+Huang_F/0/1/0/all/0/1">Fei Huang</a> Posted: June 23, 2023

Cross-modal contrastive learning in vision language pretraining (VLP) faces
the challenge of (partial) false negatives. In this paper, we study this
problem from the perspective of Mutual Information (MI) optimization. It is
common sense that InfoNCE loss used in contrastive learning will maximize the
lower bound of MI between anchors and their positives, while we theoretically
prove that MI involving negatives also matters when noises commonly exist.
Guided by a more general lower bound form for optimization, we propose a
contrastive learning strategy regulated by progressively refined cross-modal
similarity, to more accurately optimize MI between an image/text anchor and its
negative texts/images instead of improperly minimizing it. Our method performs
competitively on four downstream cross-modal tasks and systematically balances
the beneficial and harmful effects of (partial) false negative samples under
theoretical guidance.

Provided by:
http://arxiv.org/icons/sfx.gif

DoctorMorDi

DoctorMorDi

Moderator and Editor