CINXE.COM
CVPR 2015 Open Access Repository
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html; charset=ISO-8859-1" http-equiv="content-type"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>CVPR 2015 Open Access Repository</title> <link rel="stylesheet" type="text/css" href="../../static/conf.css"> <script type="text/javascript" src="../../static/jquery.js"></script> <meta name="citation_title" content="From Captions to Visual Concepts and Back"> <meta name="citation_author" content="Fang, Hao"> <meta name="citation_author" content="Gupta, Saurabh"> <meta name="citation_author" content="Iandola, Forrest"> <meta name="citation_author" content="Srivastava, Rupesh K."> <meta name="citation_author" content="Deng, Li"> <meta name="citation_author" content="Dollar, Piotr"> <meta name="citation_author" content="Gao, Jianfeng"> <meta name="citation_author" content="He, Xiaodong"> <meta name="citation_author" content="Mitchell, Margaret"> <meta name="citation_author" content="Platt, John C."> <meta name="citation_author" content="Lawrence Zitnick, C."> <meta name="citation_author" content="Zweig, Geoffrey"> <meta name="citation_publication_date" content="2015"> <meta name="citation_conference_title" content="Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition"> <meta name="citation_firstpage" content="1473"> <meta name="citation_lastpage" content="1482"> <meta name="citation_pdf_url" content="http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Fang_From_Captions_to_2015_CVPR_paper.pdf"> </head> <body> <div id="header"> <div id="header_left"> <a href="http://www.pamitc.org/cvpr15/"><img src="../../img/cvpr15_logo.png" width="175" height="100" border="0" alt="CVPR 2015"></a> <a href="http://www.cv-foundation.org/"><img src="../../img/cropped-cvf-s.jpg" width="175" height="112" border="0" alt="CVF"></a> </div> <div id="header_right"> <div id="header_title"> <a href="http://www.pamitc.org/cvpr15/">CVPR 2015</a> <a href="/" class="a_monochrome">open access</a> </div> <div id="help" > These CVPR 2015 papers are the Open Access versions, provided by the <a href="http://www.cv-foundation.org/">Computer Vision Foundation.</a><br> Except for the watermark, they are identical to the accepted versions; the final published version of the proceedings is available on IEEE Xplore.</div> <div id="disclaimer" > This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.<br><br> <form action="../../CVPR2015_search.py" method="post"> <input type="text" name="query"> <input type="submit" value="Search"> </form> </div> </div> </div> <div class="clear"> </div> <div id="content"> <dl> <dd> <div id="papertitle"> From Captions to Visual Concepts and Back</div> <div id="authors"> <br><b><i>Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, Geoffrey Zweig</i></b>; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1473-1482 </div><font size="5"> <br><b>Abstract</b> </font> <br><br><div id="abstract" > This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1%. When human judges compare the system captions to ones written by other people on our held-out test set, the system captions have equal or better quality 34% of the time.</div> <font size="5"> <br><b>Related Material</b> </font> <br><br> [<a href="../../content_cvpr_2015/papers/Fang_From_Captions_to_2015_CVPR_paper.pdf">pdf</a>] <div class="link2">[<a class="fakelink" onclick="$(this).siblings('.bibref').slideToggle()">bibtex</a>] <div class="bibref"> @InProceedings{Fang_2015_CVPR,<br> author = {Fang, Hao and Gupta, Saurabh and Iandola, Forrest and Srivastava, Rupesh K. and Deng, Li and Dollar, Piotr and Gao, Jianfeng and He, Xiaodong and Mitchell, Margaret and Platt, John C. and Lawrence Zitnick, C. and Zweig, Geoffrey},<br> title = {From Captions to Visual Concepts and Back},<br> booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},<br> month = {June},<br> year = {2015}<br> } </div> </div> </dd> </dl> </div> </body> </html>