CINXE.COM
GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model
<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model</title> <!--Generated on Thu Oct 17 03:24:45 2024 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content="Machine Learning, ICML" lang="en" name="keywords"/> <base href="/html/2406.18572v2/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S1" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S2" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Related work</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S2.SS1" title="In 2 Related work ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.1 </span>Street Views</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S2.SS2" title="In 2 Related work ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.2 </span>Image-based Geo-localization</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S2.SS3" title="In 2 Related work ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.3 </span>Vision-Language Models</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>GeoReasoner</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.SS1" title="In 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>Locatability-Enhanced Data Curation</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.SS2" title="In 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>Geo-localization with Reasoning</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>Experiments</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS1" title="In 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>Experiments on Locatability-Enhanced Dataset</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS1.SSS1" title="In 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1.1 </span>Qualitative Comparison</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS1.SSS2" title="In 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1.2 </span>Quantitative Comparison</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2" title="In 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2 </span>Experiments on Geo-localization with Reasoning</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2.SSS1" title="In 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2.1 </span>Qualitative Comparison with SOTA</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2.SSS2" title="In 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2.2 </span>Quantitative Comparison with SOTA</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2.SSS3" title="In 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2.3 </span>Ablation Experiments</span></a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2.SSS4" title="In 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.2.4 </span>Generalizability Evaluation</span></a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S5" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Discussion</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S6" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6 </span>Conclusion</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#A1" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A </span>Implementation Details</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#A2" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">B </span>Additional Qualitative Results</span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_pruned_first"> <h1 class="ltx_title ltx_title_document">GeoReasoner: Geo-localization with Reasoning in Street Views <br class="ltx_break"/>using a Large Vision-Language Model</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Ling Li </span></span> <span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Yu Ye </span></span> <span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Bingchuan Jiang </span></span> <span class="ltx_author_before"> </span><span class="ltx_creator ltx_role_author"> <span class="ltx_personname">Wei Zeng </span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id1.id1">This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train <em class="ltx_emph ltx_font_italic" id="id1.id1.1">GeoReasoner</em>, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that <em class="ltx_emph ltx_font_italic" id="id1.id1.2">GeoReasoner</em> outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at <a class="ltx_ref ltx_href" href="https://github.com/lingli1996/GeoReasoner" title="">https://github.com/lingli1996/GeoReasoner</a>.</p> </div> <div class="ltx_keywords">Machine Learning, ICML </div> <div class="ltx_para" id="p2"> <br class="ltx_break"/> </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2> <div class="ltx_para" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Street-view geo-localization seeks to predict geographical locations for the given street-view images. The significance of street-view geo-localization is evident in a variety of applications, spanning social studies <cite class="ltx_cite ltx_citemacro_citep">(Ye et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib36" title="">2019b</a>)</cite>, urban planning <cite class="ltx_cite ltx_citemacro_citep">(Shen et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib27" title="">2018</a>)</cite>, and navigation <cite class="ltx_cite ltx_citemacro_citep">(Chalvatzaras et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib5" title="">2022</a>)</cite>. As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">1</span></a> (left), existing frameworks for street-view geo-localization can be mainly divided into two categories: <em class="ltx_emph ltx_font_italic" id="S1.p1.1.1">retrieval-based</em> and <em class="ltx_emph ltx_font_italic" id="S1.p1.1.2">classification-based</em>. <em class="ltx_emph ltx_font_italic" id="S1.p1.1.3">Retrieval-based</em> approaches entail identifying the most similar image within a geo-tagged image gallery and returning the corresponding geographical location <cite class="ltx_cite ltx_citemacro_citep">(Zhu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib41" title="">2022</a>; Lin et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib16" title="">2022</a>; Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib40" title="">2023b</a>)</cite>. However, the methods rely on the diversity and comprehensiveness of the geo-tagged image gallery, which can be challenging to curate. Alternatively, <em class="ltx_emph ltx_font_italic" id="S1.p1.1.4">classification-based</em> approaches partition the Earth’s surface into distinct regions and assign the input image to a specific region <cite class="ltx_cite ltx_citemacro_citep">(Clark et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib8" title="">2023</a>; Pramanick et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib20" title="">2022</a>; Müller-Budack et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib19" title="">2018</a>; Seo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib25" title="">2018</a>; Weyand et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib32" title="">2016</a>)</cite>. While these methods leverage shared visual features within a single region, they may neglect valuable semantic information (<em class="ltx_emph ltx_font_italic" id="S1.p1.1.5">e.g.</em>, signboard texts) crucial for geo-localization. More importantly, these classification methods often operate as black-box models, lacking reasoning capabilities for users to interpret.</p> </div> <figure class="ltx_figure" id="S1.F1"> <p class="ltx_p ltx_align_center ltx_align_center" id="S1.F1.1.1"><span class="ltx_text" id="S1.F1.1.1.1"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="439" id="S1.F1.1.1.1.g1" src="x1.png" width="822"/></span></p> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>Different paradigms in existing and the proposed geo-localization approaches: retrieval-based (left-top), classification-based (left-bottom), and our LVLM-based (right).</figcaption> </figure> <div class="ltx_para" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">Achieving street view-based geo-localization with reasoning capability poses a considerable challenge. This study introduces a new paradigm that facilitates geo-localization with reasoning capability for street-view images, as depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">1</span></a>(right). The paradigm leverages an LVLM for its excellent capability in handling multi-modal visual and textual inputs and incorporates external knowledge learned from various online games for the reasoning procedure. Specifically, we introduce the concept of <em class="ltx_emph ltx_font_italic" id="S1.p2.1.1">locatability</em> as a metric to quantify the degree of locatability in street-view images. On this basis, we devise a CLIP-based visual-text pairing network to match large-scale Google Street View (GSV) images with 3K finely reasoned text-image pairs from online games, to tackle the challenge of the absence of a high-quality street-view dataset. The process filters through over 70K GSV images with geo-tags, all of which exhibit a high degree of locatability.</p> </div> <div class="ltx_para" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">Next, we construct an LVLM model, named <em class="ltx_emph ltx_font_italic" id="S1.p3.1.1">GeoReasoner</em>, to overcome the difficulty of integrating reasoning capability in geo-localization. The training procedures of <em class="ltx_emph ltx_font_italic" id="S1.p3.1.2">GeoReasoner</em> are divided into two folds: reasoning tuning and location tuning. In the first stage, we utilize the 3K reasoned text-image pairs encapsulating human inference knowledge, to fine-tune a well-trained LVLM model with LoRA <cite class="ltx_cite ltx_citemacro_citep">(Hu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib13" title="">2022</a>)</cite> for reasoning adaptation. In the second stage, we leverage the curated 70K high-locatability GSV images dataset, to further fine-tune the LVLM model with another LoRA stacked on the first one for location tuning. We assess <em class="ltx_emph ltx_font_italic" id="S1.p3.1.3">GeoReasoner</em> in terms of accuracy for both country-level (<em class="ltx_emph ltx_font_italic" id="S1.p3.1.4">i.e.</em>, predicting the country in which a street view is located) and city-level (<em class="ltx_emph ltx_font_italic" id="S1.p3.1.5">i.e.</em>, predicting the city in which a street view is located) geo-localization. The results demonstrate that <em class="ltx_emph ltx_font_italic" id="S1.p3.1.6">GeoReasoner</em> outperforms the other counterparts by more than <em class="ltx_emph ltx_font_italic" id="S1.p3.1.7">25%</em> at the country-level geo-localization and <em class="ltx_emph ltx_font_italic" id="S1.p3.1.8">38%</em> at the city-level geo-localization with reasoning on our test dataset. Notably, <em class="ltx_emph ltx_font_italic" id="S1.p3.1.9">GeoReasoner</em> performs slightly better than StreetCILP <cite class="ltx_cite ltx_citemacro_citep">(Haas et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib11" title="">2023</a>)</cite>, which was trained on a substantially larger dataset of 1.1 million geo-tagged street-view images. We also evaluate <em class="ltx_emph ltx_font_italic" id="S1.p3.1.10">GeoReasoner</em> against state-of-the-art models for geo-localization using open benchmark datasets. The results show that <em class="ltx_emph ltx_font_italic" id="S1.p3.1.11">GeoReasoner</em> achieves comparable performance with only 10k Flickr images used for training. The main contributions of our work are:</p> <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i1.p1"> <p class="ltx_p" id="S1.I1.i1.p1.1">We present a new paradigm that leverages an LVLM and external knowledge of human inference for geo-localization with reasoning from street-view images.</p> </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i2.p1"> <p class="ltx_p" id="S1.I1.i2.p1.1">We introduce the concept of <em class="ltx_emph ltx_font_italic" id="S1.I1.i2.p1.1.1">locatability</em> and devise a CLIP-based network to quantify the degree of locatability in street-view images.</p> </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S1.I1.i3.p1"> <p class="ltx_p" id="S1.I1.i3.p1.1">We propose <em class="ltx_emph ltx_font_italic" id="S1.I1.i3.p1.1.1">GeoReasoner</em>, an LVLM that outperforms existing geo-localization models and provides detailed reasoning for the inferred results.</p> </div> </li> </ul> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">2 </span>Related work</h2> <section class="ltx_subsection" id="S2.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.1 </span>Street Views</h3> <div class="ltx_para" id="S2.SS1.p1"> <p class="ltx_p" id="S2.SS1.p1.1">Street views, as the realm of physical environments routinely accessed and engaged with in daily life, bear significant relevance to human perception <cite class="ltx_cite ltx_citemacro_citep">(Ye et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib36" title="">2019b</a>)</cite> and urban design <cite class="ltx_cite ltx_citemacro_citep">(Shen et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib27" title="">2018</a>)</cite>. Analyses of street views contribute to decision-making support <cite class="ltx_cite ltx_citemacro_citep">(Ye et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib35" title="">2019a</a>)</cite>, improved understanding of urban social and economic structures <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib3" title="">2023b</a>)</cite>, and traffic asset monitoring and maintenance <cite class="ltx_cite ltx_citemacro_citep">(Campbell et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib4" title="">2019</a>; Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib15" title="">2021</a>)</cite>. This study places an emphasis on geo-localization based on street views. Specifically, drawing motivation from <cite class="ltx_cite ltx_citemacro_citet">Zhang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib38" title="">2018</a>)</cite>, we delineate the distribution of scene elements to quantify the degree of <em class="ltx_emph ltx_font_italic" id="S2.SS1.p1.1.1">locatability</em> in street views. Highly locatable street-view images are curated to train an LVLM that surpasses existing geo-localization models. </p> </div> </section> <section class="ltx_subsection" id="S2.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.2 </span>Image-based Geo-localization</h3> <div class="ltx_para" id="S2.SS2.p1"> <p class="ltx_p" id="S2.SS2.p1.1">Geo-localization entails determining spatial coordinates on the Earth’s surface, with broad applications in practical scenarios, including tracking individual trajectories <cite class="ltx_cite ltx_citemacro_citep">(Cheng et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib7" title="">2022</a>)</cite> and positioning autonomous vehicles <cite class="ltx_cite ltx_citemacro_citep">(Chalvatzaras et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib5" title="">2022</a>)</cite>. This study focuses on image-based geo-localization, utilizing image data as input. Research on image-based geo-localization can be primarily classified into two approaches: retrieval-based <cite class="ltx_cite ltx_citemacro_citep">(Zhu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib41" title="">2022</a>; Lin et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib16" title="">2022</a>; Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib40" title="">2023b</a>)</cite> and classification-based <cite class="ltx_cite ltx_citemacro_citep">(Clark et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib8" title="">2023</a>; Pramanick et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib20" title="">2022</a>; Müller-Budack et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib19" title="">2018</a>; Seo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib25" title="">2018</a>; Weyand et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib32" title="">2016</a>)</cite>.</p> </div> <div class="ltx_para" id="S2.SS2.p2"> <p class="ltx_p" id="S2.SS2.p2.1">The retrieval-based approach involves the sequential matching of a single image with a gallery of overhead views, each labeled with geographical coordinates, and identifying the result with the highest matching as the location. However, the utilization of this method is limited due to its requirement for additional reference datasets. The classification-based approach, exemplified by <cite class="ltx_cite ltx_citemacro_citet">Weyand et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib32" title="">2016</a>)</cite>, involves subdividing the Earth’s surface into thousands of geographical cells and predicting the geographical unit to which an image belongs. The prediction effectiveness can be boosted with a dataset comprising millions of street views, whilst the granularity is influenced by the number of subdivided geographical cells. As such, many studies have been devoted to learning to corresponding multi-level features at different granularity <cite class="ltx_cite ltx_citemacro_citep">(Vo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib30" title="">2017</a>)</cite>, or multi-pair features for different tasks <cite class="ltx_cite ltx_citemacro_citep">(Müller-Budack et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib19" title="">2018</a>; Pramanick et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib20" title="">2022</a>; Vivanco Cepeda et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib29" title="">2024</a>)</cite>.</p> </div> <div class="ltx_para" id="S2.SS2.p3"> <p class="ltx_p" id="S2.SS2.p3.1">We approach image-based geo-localization with a novel paradigm. Specifically, we integrate semantic visual concepts that offer locatable features <cite class="ltx_cite ltx_citemacro_citep">(Luo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib18" title="">2022</a>; Theiner et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib28" title="">2022</a>)</cite>, and incorporate human reasoning knowledge learned from geo-localization games using an LVLM.</p> </div> </section> <section class="ltx_subsection" id="S2.SS3"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">2.3 </span>Vision-Language Models</h3> <div class="ltx_para" id="S2.SS3.p1"> <p class="ltx_p" id="S2.SS3.p1.1">The emergence of Large Language Models (LLMs) has significantly impacted various tasks related to natural language processing. These models exhibit remarkable performance in tasks such as text generation <cite class="ltx_cite ltx_citemacro_citep">(Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib39" title="">2023a</a>)</cite> and text-based question answering <cite class="ltx_cite ltx_citemacro_citep">(Shao et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib26" title="">2023</a>)</cite>, owing to their robust and versatile capabilities. As a result, research attention has shifted towards exploring prompt engineering techniques to enhance the performance of LLMs in downstream tasks <cite class="ltx_cite ltx_citemacro_citep">(Wei et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib31" title="">2022</a>; Yao et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib34" title="">2024</a>; Dai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib9" title="">2023</a>; Xu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib33" title="">2023</a>; Ying et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib37" title="">2024</a>)</cite>.</p> </div> <div class="ltx_para" id="S2.SS3.p2"> <p class="ltx_p" id="S2.SS3.p2.1">Large vision-language models (LVLMs) integrate visual encoders with LLMs, exhibiting remarkable effectiveness in visual question-answering tasks <cite class="ltx_cite ltx_citemacro_citep">(Liu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib17" title="">2024</a>; Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib2" title="">2023a</a>; Rao et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib23" title="">2023</a>)</cite>. This study harnesses the capabilities of LVLMs to address geo-localization of street views. However, the optimal utilization of LVLMs remains a challenging issue, particularly due to the absence of high-quality training data and a lack of reasoning capabilities. We overcome these challenges through an innovative paradigm and the thoughtful design of model architecture, contributing to a more effective utilization of LVLMs in this domain.</p> </div> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">3 </span>GeoReasoner</h2> <div class="ltx_para" id="S3.p1"> <p class="ltx_p" id="S3.p1.1">This section outlines our approach to addressing two challenges: 1) the absence of a high-quality street-view geo-localization dataset (discussed in Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.SS1" title="3.1 Locatability-Enhanced Data Curation ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">3.1</span></a>), and 2) the difficulty of integrating reasoning in geo-localization (discussed in Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.SS2" title="3.2 Geo-localization with Reasoning ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">3.2</span></a>), when constructing <em class="ltx_emph ltx_font_italic" id="S3.p1.1.1">GeoReasoner</em>.</p> </div> <figure class="ltx_figure" id="S3.F2"> <p class="ltx_p ltx_align_center ltx_align_center" id="S3.F2.1"><span class="ltx_text" id="S3.F2.1.1"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="394" id="S3.F2.1.1.g1" src="x2.png" width="822"/></span></p> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>The locatability quantization network devises a CLIP-based visual-text pairing approach to predict the locatability metric.</figcaption> </figure> <figure class="ltx_figure" id="S3.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="260" id="S3.F3.g1" src="x3.png" width="822"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>The architecture of <em class="ltx_emph ltx_font_italic" id="S3.F3.5.1">GeoReasoner</em> consists of three modules: <em class="ltx_emph ltx_font_italic" id="S3.F3.6.2">Vision Encoder</em>, <em class="ltx_emph ltx_font_italic" id="S3.F3.7.3">VL Adapter</em> and <em class="ltx_emph ltx_font_italic" id="S3.F3.8.4">Pre-trained LLM</em>. The model undergoes a two-fold supervised fine-tuning process: reasoning tuning and location tuning, to enable geo-localization with reasoning. </figcaption> </figure> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.1 </span>Locatability-Enhanced Data Curation</h3> <div class="ltx_para" id="S3.SS1.p1"> <p class="ltx_p" id="S3.SS1.p1.1">Throughout the development of this work, we observed variations in the degree of locatability among different street views. For example, the images featuring textual signboards or prominent landmarks (<em class="ltx_emph ltx_font_italic" id="S3.SS1.p1.1.1">e.g.</em>, Eiffel Tower) are easily locatable, whilst those captured in a tunnel or obscured by a wall tend to be less locatable. Refer to Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.F4" title="Figure 4 ‣ 3.2 Geo-localization with Reasoning ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">4</span></a> for further illustration. Simply merging all these street-view images to train an LVLM is not an optimal approach, as the inclusion of poor-quality data can adversely affect the training efficiency of updating an LVLM <cite class="ltx_cite ltx_citemacro_citep">(Radford et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib22" title="">2021</a>)</cite>. To this end, we introduce <em class="ltx_emph ltx_font_italic" id="S3.SS1.p1.1.2">locatability</em>, a metric that quantifies the level of locatability of street-view images. We then devise a CLIP-based visual-text pairing network to produce the desired locatability metric for an input street-view image, as shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.F2" title="Figure 2 ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">2</span></a>. The network naturally incorporates data from two perspectives:</p> </div> <div class="ltx_para" id="S3.SS1.p2"> <ul class="ltx_itemize" id="S3.I1"> <li class="ltx_item" id="S3.I1.i1" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i1.p1"> <p class="ltx_p" id="S3.I1.i1.p1.3"><span class="ltx_text ltx_font_bold" id="S3.I1.i1.p1.3.1">Street-View Images.</span> We collected street-view images from the Google Street View<span class="ltx_note ltx_role_footnote" id="footnote1"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_tag ltx_tag_note">1</span>https://www.google.com/streetview</span></span></span> (GSV). To enrich the diversity of the dataset, we first selected the top global cities according to the Globalization and World Cities Study Group and Network (GaWC) ranking. Next, we utilized the global OpenStreetMap<span class="ltx_note ltx_role_footnote" id="footnote2"><sup class="ltx_note_mark">2</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">2</sup><span class="ltx_tag ltx_tag_note">2</span>https://www.openstreetmap.org</span></span></span> (OSM) geographic database to obtain the vector data of the road network in these urban areas. The road network was passed to ArcPy, a Python site package of ArcGIS, to automatically extract sampling points at 4000-meter intervals and generate a CSV table containing information about these sampling points. Subsequently, we employed the GSV API interface to compile a comprehensive dataset. This dataset encompassed street-view images captured from four distinct directions - front, back, left, and right - of each sampling point. Considering the impact of data sparsity and image similarity, we randomly selected two of the four views from each data point, denoted as [<math alttext="\textbf{I}_{x},\textbf{I}_{y}" class="ltx_Math" display="inline" id="S3.I1.i1.p1.1.m1.2"><semantics id="S3.I1.i1.p1.1.m1.2a"><mrow id="S3.I1.i1.p1.1.m1.2.2.2" xref="S3.I1.i1.p1.1.m1.2.2.3.cmml"><msub id="S3.I1.i1.p1.1.m1.1.1.1.1" xref="S3.I1.i1.p1.1.m1.1.1.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.I1.i1.p1.1.m1.1.1.1.1.2" xref="S3.I1.i1.p1.1.m1.1.1.1.1.2a.cmml">I</mtext><mi id="S3.I1.i1.p1.1.m1.1.1.1.1.3" xref="S3.I1.i1.p1.1.m1.1.1.1.1.3.cmml">x</mi></msub><mo id="S3.I1.i1.p1.1.m1.2.2.2.3" xref="S3.I1.i1.p1.1.m1.2.2.3.cmml">,</mo><msub id="S3.I1.i1.p1.1.m1.2.2.2.2" xref="S3.I1.i1.p1.1.m1.2.2.2.2.cmml"><mtext class="ltx_mathvariant_bold" id="S3.I1.i1.p1.1.m1.2.2.2.2.2" xref="S3.I1.i1.p1.1.m1.2.2.2.2.2a.cmml">I</mtext><mi id="S3.I1.i1.p1.1.m1.2.2.2.2.3" xref="S3.I1.i1.p1.1.m1.2.2.2.2.3.cmml">y</mi></msub></mrow><annotation-xml encoding="MathML-Content" id="S3.I1.i1.p1.1.m1.2b"><list id="S3.I1.i1.p1.1.m1.2.2.3.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2"><apply id="S3.I1.i1.p1.1.m1.1.1.1.1.cmml" xref="S3.I1.i1.p1.1.m1.1.1.1.1"><csymbol cd="ambiguous" id="S3.I1.i1.p1.1.m1.1.1.1.1.1.cmml" xref="S3.I1.i1.p1.1.m1.1.1.1.1">subscript</csymbol><ci id="S3.I1.i1.p1.1.m1.1.1.1.1.2a.cmml" xref="S3.I1.i1.p1.1.m1.1.1.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.I1.i1.p1.1.m1.1.1.1.1.2.cmml" xref="S3.I1.i1.p1.1.m1.1.1.1.1.2">I</mtext></ci><ci id="S3.I1.i1.p1.1.m1.1.1.1.1.3.cmml" xref="S3.I1.i1.p1.1.m1.1.1.1.1.3">𝑥</ci></apply><apply id="S3.I1.i1.p1.1.m1.2.2.2.2.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.I1.i1.p1.1.m1.2.2.2.2.1.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2.2">subscript</csymbol><ci id="S3.I1.i1.p1.1.m1.2.2.2.2.2a.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2.2.2"><mtext class="ltx_mathvariant_bold" id="S3.I1.i1.p1.1.m1.2.2.2.2.2.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2.2.2">I</mtext></ci><ci id="S3.I1.i1.p1.1.m1.2.2.2.2.3.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2.2.3">𝑦</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.I1.i1.p1.1.m1.2c">\textbf{I}_{x},\textbf{I}_{y}</annotation><annotation encoding="application/x-llamapun" id="S3.I1.i1.p1.1.m1.2d">I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , I start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT</annotation></semantics></math>], where <math alttext="x\in(left,right)" class="ltx_Math" display="inline" id="S3.I1.i1.p1.2.m2.2"><semantics id="S3.I1.i1.p1.2.m2.2a"><mrow id="S3.I1.i1.p1.2.m2.2.2" xref="S3.I1.i1.p1.2.m2.2.2.cmml"><mi id="S3.I1.i1.p1.2.m2.2.2.4" xref="S3.I1.i1.p1.2.m2.2.2.4.cmml">x</mi><mo id="S3.I1.i1.p1.2.m2.2.2.3" xref="S3.I1.i1.p1.2.m2.2.2.3.cmml">∈</mo><mrow id="S3.I1.i1.p1.2.m2.2.2.2.2" xref="S3.I1.i1.p1.2.m2.2.2.2.3.cmml"><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.3" stretchy="false" xref="S3.I1.i1.p1.2.m2.2.2.2.3.cmml">(</mo><mrow id="S3.I1.i1.p1.2.m2.1.1.1.1.1" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.cmml"><mi id="S3.I1.i1.p1.2.m2.1.1.1.1.1.2" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.2.cmml">l</mi><mo id="S3.I1.i1.p1.2.m2.1.1.1.1.1.1" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.1.cmml"></mo><mi id="S3.I1.i1.p1.2.m2.1.1.1.1.1.3" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.3.cmml">e</mi><mo id="S3.I1.i1.p1.2.m2.1.1.1.1.1.1a" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.1.cmml"></mo><mi id="S3.I1.i1.p1.2.m2.1.1.1.1.1.4" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.4.cmml">f</mi><mo id="S3.I1.i1.p1.2.m2.1.1.1.1.1.1b" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.1.cmml"></mo><mi id="S3.I1.i1.p1.2.m2.1.1.1.1.1.5" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.5.cmml">t</mi></mrow><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.4" xref="S3.I1.i1.p1.2.m2.2.2.2.3.cmml">,</mo><mrow id="S3.I1.i1.p1.2.m2.2.2.2.2.2" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.cmml"><mi id="S3.I1.i1.p1.2.m2.2.2.2.2.2.2" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.2.cmml">r</mi><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.2.1" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.1.cmml"></mo><mi id="S3.I1.i1.p1.2.m2.2.2.2.2.2.3" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.3.cmml">i</mi><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.2.1a" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.1.cmml"></mo><mi id="S3.I1.i1.p1.2.m2.2.2.2.2.2.4" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.4.cmml">g</mi><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.2.1b" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.1.cmml"></mo><mi id="S3.I1.i1.p1.2.m2.2.2.2.2.2.5" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.5.cmml">h</mi><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.2.1c" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.1.cmml"></mo><mi id="S3.I1.i1.p1.2.m2.2.2.2.2.2.6" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.6.cmml">t</mi></mrow><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.5" stretchy="false" xref="S3.I1.i1.p1.2.m2.2.2.2.3.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.I1.i1.p1.2.m2.2b"><apply id="S3.I1.i1.p1.2.m2.2.2.cmml" xref="S3.I1.i1.p1.2.m2.2.2"><in id="S3.I1.i1.p1.2.m2.2.2.3.cmml" xref="S3.I1.i1.p1.2.m2.2.2.3"></in><ci id="S3.I1.i1.p1.2.m2.2.2.4.cmml" xref="S3.I1.i1.p1.2.m2.2.2.4">𝑥</ci><interval closure="open" id="S3.I1.i1.p1.2.m2.2.2.2.3.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2"><apply id="S3.I1.i1.p1.2.m2.1.1.1.1.1.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1"><times id="S3.I1.i1.p1.2.m2.1.1.1.1.1.1.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.1"></times><ci id="S3.I1.i1.p1.2.m2.1.1.1.1.1.2.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.2">𝑙</ci><ci id="S3.I1.i1.p1.2.m2.1.1.1.1.1.3.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.3">𝑒</ci><ci id="S3.I1.i1.p1.2.m2.1.1.1.1.1.4.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.4">𝑓</ci><ci id="S3.I1.i1.p1.2.m2.1.1.1.1.1.5.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.5">𝑡</ci></apply><apply id="S3.I1.i1.p1.2.m2.2.2.2.2.2.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2"><times id="S3.I1.i1.p1.2.m2.2.2.2.2.2.1.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.1"></times><ci id="S3.I1.i1.p1.2.m2.2.2.2.2.2.2.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.2">𝑟</ci><ci id="S3.I1.i1.p1.2.m2.2.2.2.2.2.3.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.3">𝑖</ci><ci id="S3.I1.i1.p1.2.m2.2.2.2.2.2.4.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.4">𝑔</ci><ci id="S3.I1.i1.p1.2.m2.2.2.2.2.2.5.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.5">ℎ</ci><ci id="S3.I1.i1.p1.2.m2.2.2.2.2.2.6.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.6">𝑡</ci></apply></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.I1.i1.p1.2.m2.2c">x\in(left,right)</annotation><annotation encoding="application/x-llamapun" id="S3.I1.i1.p1.2.m2.2d">italic_x ∈ ( italic_l italic_e italic_f italic_t , italic_r italic_i italic_g italic_h italic_t )</annotation></semantics></math>, <math alttext="y\in(front,back)" class="ltx_Math" display="inline" id="S3.I1.i1.p1.3.m3.2"><semantics id="S3.I1.i1.p1.3.m3.2a"><mrow id="S3.I1.i1.p1.3.m3.2.2" xref="S3.I1.i1.p1.3.m3.2.2.cmml"><mi id="S3.I1.i1.p1.3.m3.2.2.4" xref="S3.I1.i1.p1.3.m3.2.2.4.cmml">y</mi><mo id="S3.I1.i1.p1.3.m3.2.2.3" xref="S3.I1.i1.p1.3.m3.2.2.3.cmml">∈</mo><mrow id="S3.I1.i1.p1.3.m3.2.2.2.2" xref="S3.I1.i1.p1.3.m3.2.2.2.3.cmml"><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.3" stretchy="false" xref="S3.I1.i1.p1.3.m3.2.2.2.3.cmml">(</mo><mrow id="S3.I1.i1.p1.3.m3.1.1.1.1.1" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.cmml"><mi id="S3.I1.i1.p1.3.m3.1.1.1.1.1.2" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.2.cmml">f</mi><mo id="S3.I1.i1.p1.3.m3.1.1.1.1.1.1" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.1.cmml"></mo><mi id="S3.I1.i1.p1.3.m3.1.1.1.1.1.3" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.3.cmml">r</mi><mo id="S3.I1.i1.p1.3.m3.1.1.1.1.1.1a" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.1.cmml"></mo><mi id="S3.I1.i1.p1.3.m3.1.1.1.1.1.4" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.4.cmml">o</mi><mo id="S3.I1.i1.p1.3.m3.1.1.1.1.1.1b" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.1.cmml"></mo><mi id="S3.I1.i1.p1.3.m3.1.1.1.1.1.5" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.5.cmml">n</mi><mo id="S3.I1.i1.p1.3.m3.1.1.1.1.1.1c" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.1.cmml"></mo><mi id="S3.I1.i1.p1.3.m3.1.1.1.1.1.6" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.6.cmml">t</mi></mrow><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.4" xref="S3.I1.i1.p1.3.m3.2.2.2.3.cmml">,</mo><mrow id="S3.I1.i1.p1.3.m3.2.2.2.2.2" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.cmml"><mi id="S3.I1.i1.p1.3.m3.2.2.2.2.2.2" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.2.cmml">b</mi><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.2.1" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.1.cmml"></mo><mi id="S3.I1.i1.p1.3.m3.2.2.2.2.2.3" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.3.cmml">a</mi><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.2.1a" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.1.cmml"></mo><mi id="S3.I1.i1.p1.3.m3.2.2.2.2.2.4" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.4.cmml">c</mi><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.2.1b" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.1.cmml"></mo><mi id="S3.I1.i1.p1.3.m3.2.2.2.2.2.5" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.5.cmml">k</mi></mrow><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.5" stretchy="false" xref="S3.I1.i1.p1.3.m3.2.2.2.3.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.I1.i1.p1.3.m3.2b"><apply id="S3.I1.i1.p1.3.m3.2.2.cmml" xref="S3.I1.i1.p1.3.m3.2.2"><in id="S3.I1.i1.p1.3.m3.2.2.3.cmml" xref="S3.I1.i1.p1.3.m3.2.2.3"></in><ci id="S3.I1.i1.p1.3.m3.2.2.4.cmml" xref="S3.I1.i1.p1.3.m3.2.2.4">𝑦</ci><interval closure="open" id="S3.I1.i1.p1.3.m3.2.2.2.3.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2"><apply id="S3.I1.i1.p1.3.m3.1.1.1.1.1.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1"><times id="S3.I1.i1.p1.3.m3.1.1.1.1.1.1.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.1"></times><ci id="S3.I1.i1.p1.3.m3.1.1.1.1.1.2.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.2">𝑓</ci><ci id="S3.I1.i1.p1.3.m3.1.1.1.1.1.3.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.3">𝑟</ci><ci id="S3.I1.i1.p1.3.m3.1.1.1.1.1.4.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.4">𝑜</ci><ci id="S3.I1.i1.p1.3.m3.1.1.1.1.1.5.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.5">𝑛</ci><ci id="S3.I1.i1.p1.3.m3.1.1.1.1.1.6.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.6">𝑡</ci></apply><apply id="S3.I1.i1.p1.3.m3.2.2.2.2.2.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2"><times id="S3.I1.i1.p1.3.m3.2.2.2.2.2.1.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.1"></times><ci id="S3.I1.i1.p1.3.m3.2.2.2.2.2.2.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.2">𝑏</ci><ci id="S3.I1.i1.p1.3.m3.2.2.2.2.2.3.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.3">𝑎</ci><ci id="S3.I1.i1.p1.3.m3.2.2.2.2.2.4.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.4">𝑐</ci><ci id="S3.I1.i1.p1.3.m3.2.2.2.2.2.5.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.5">𝑘</ci></apply></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.I1.i1.p1.3.m3.2c">y\in(front,back)</annotation><annotation encoding="application/x-llamapun" id="S3.I1.i1.p1.3.m3.2d">italic_y ∈ ( italic_f italic_r italic_o italic_n italic_t , italic_b italic_a italic_c italic_k )</annotation></semantics></math>. The process has yielded a total of over 130k street-view images with geo-tags collected from 72 cities in 48 countries.</p> </div> </li> <li class="ltx_item" id="S3.I1.i2" style="list-style-type:none;"> <span class="ltx_tag ltx_tag_item">•</span> <div class="ltx_para" id="S3.I1.i2.p1"> <p class="ltx_p" id="S3.I1.i2.p1.1"><span class="ltx_text ltx_font_bold" id="S3.I1.i2.p1.1.1">Textual Clues.</span> Textual clues often serve a pivotal role in delineating the geographical locations of street-view images. Two prominent games, GeoGuessr<span class="ltx_note ltx_role_footnote" id="footnote3"><sup class="ltx_note_mark">3</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">3</sup><span class="ltx_tag ltx_tag_note">3</span>https://www.geoguessr.com</span></span></span> and Tuxun<span class="ltx_note ltx_role_footnote" id="footnote4"><sup class="ltx_note_mark">4</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">4</sup><span class="ltx_tag ltx_tag_note">4</span>https://tuxun.fun</span></span></span>, which focus on geo-localization through street views, offer a potential solution to this gap. Their communities have collaboratively curated a well-organized collection of textual clues, used for pinpointing geographical locations across various countries and cities. These clues, maintained by both players and administrators, provide valuable domain knowledge that aids in identifying and evaluating key geographical features in street views. While such datasets now exist <cite class="ltx_cite ltx_citemacro_citep">(Luo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib18" title="">2022</a>)</cite>, there are no readily available image-text data pairs specifically tailored for LVLM training. To bridge this gap, we gathered image-text pairs for geo-localization from these two open-source communities. Subsequently, we utilized the BERT-based Named Entity Recognition (NER) <cite class="ltx_cite ltx_citemacro_citep">(Kenton & Toutanova, <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib14" title="">2019</a>)</cite> model to clean and filter text that lacked specific geographical location information. In this way, we collected a total of over 3K textual clues that encapsulate rich geo-localization information. For instance, <em class="ltx_emph ltx_font_italic" id="S3.I1.i2.p1.1.2">“houses in central Chile are more likely to have terracotta tiled roofs”</em>. Each clue is paired with a corresponding street-view image.</p> </div> </li> </ul> </div> <div class="ltx_para" id="S3.SS1.p3"> <p class="ltx_p" id="S3.SS1.p3.1">With the GSV images and textual clues, our subsequent task is to filter GSV images with a high degree of locatability, for the purpose of training an LVLM. To achieve this, we design a CLIP-based visual-text pairing network. As depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.F2" title="Figure 2 ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">2</span></a>, the GSV images undergo processing by an image encoder that deduces the image attributes.</p> </div> <div class="ltx_para" id="S3.SS1.p4"> <p class="ltx_p" id="S3.SS1.p4.14">Here, we first use MaskFormer <cite class="ltx_cite ltx_citemacro_citep">(Cheng et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib6" title="">2021</a>)</cite> to predict segmentation masks for various classes in GSV images, such as buildings, sky, and vehicles. We then compute an <math alttext="n" class="ltx_Math" display="inline" id="S3.SS1.p4.1.m1.1"><semantics id="S3.SS1.p4.1.m1.1a"><mi id="S3.SS1.p4.1.m1.1.1" xref="S3.SS1.p4.1.m1.1.1.cmml">n</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.1.m1.1b"><ci id="S3.SS1.p4.1.m1.1.1.cmml" xref="S3.SS1.p4.1.m1.1.1">𝑛</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.1.m1.1c">n</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.1.m1.1d">italic_n</annotation></semantics></math>-length vector <math alttext="\textbf{I}_{seg}" class="ltx_Math" display="inline" id="S3.SS1.p4.2.m2.1"><semantics id="S3.SS1.p4.2.m2.1a"><msub id="S3.SS1.p4.2.m2.1.1" xref="S3.SS1.p4.2.m2.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.2.m2.1.1.2" xref="S3.SS1.p4.2.m2.1.1.2a.cmml">I</mtext><mrow id="S3.SS1.p4.2.m2.1.1.3" xref="S3.SS1.p4.2.m2.1.1.3.cmml"><mi id="S3.SS1.p4.2.m2.1.1.3.2" xref="S3.SS1.p4.2.m2.1.1.3.2.cmml">s</mi><mo id="S3.SS1.p4.2.m2.1.1.3.1" xref="S3.SS1.p4.2.m2.1.1.3.1.cmml"></mo><mi id="S3.SS1.p4.2.m2.1.1.3.3" xref="S3.SS1.p4.2.m2.1.1.3.3.cmml">e</mi><mo id="S3.SS1.p4.2.m2.1.1.3.1a" xref="S3.SS1.p4.2.m2.1.1.3.1.cmml"></mo><mi id="S3.SS1.p4.2.m2.1.1.3.4" xref="S3.SS1.p4.2.m2.1.1.3.4.cmml">g</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.2.m2.1b"><apply id="S3.SS1.p4.2.m2.1.1.cmml" xref="S3.SS1.p4.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS1.p4.2.m2.1.1.1.cmml" xref="S3.SS1.p4.2.m2.1.1">subscript</csymbol><ci id="S3.SS1.p4.2.m2.1.1.2a.cmml" xref="S3.SS1.p4.2.m2.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.2.m2.1.1.2.cmml" xref="S3.SS1.p4.2.m2.1.1.2">I</mtext></ci><apply id="S3.SS1.p4.2.m2.1.1.3.cmml" xref="S3.SS1.p4.2.m2.1.1.3"><times id="S3.SS1.p4.2.m2.1.1.3.1.cmml" xref="S3.SS1.p4.2.m2.1.1.3.1"></times><ci id="S3.SS1.p4.2.m2.1.1.3.2.cmml" xref="S3.SS1.p4.2.m2.1.1.3.2">𝑠</ci><ci id="S3.SS1.p4.2.m2.1.1.3.3.cmml" xref="S3.SS1.p4.2.m2.1.1.3.3">𝑒</ci><ci id="S3.SS1.p4.2.m2.1.1.3.4.cmml" xref="S3.SS1.p4.2.m2.1.1.3.4">𝑔</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.2.m2.1c">\textbf{I}_{seg}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.2.m2.1d">I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT</annotation></semantics></math>, which quantifies the area ratio of each mask class, where <math alttext="n" class="ltx_Math" display="inline" id="S3.SS1.p4.3.m3.1"><semantics id="S3.SS1.p4.3.m3.1a"><mi id="S3.SS1.p4.3.m3.1.1" xref="S3.SS1.p4.3.m3.1.1.cmml">n</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.3.m3.1b"><ci id="S3.SS1.p4.3.m3.1.1.cmml" xref="S3.SS1.p4.3.m3.1.1">𝑛</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.3.m3.1c">n</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.3.m3.1d">italic_n</annotation></semantics></math> represents the number of classes. Subsequently, we utilize Sentence-BERT <cite class="ltx_cite ltx_citemacro_citep">(Reimers & Gurevych, <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib24" title="">2019</a>)</cite> to measure the similarity between textual clues and semantic segmentation labels, yielding an <math alttext="m\times n" class="ltx_Math" display="inline" id="S3.SS1.p4.4.m4.1"><semantics id="S3.SS1.p4.4.m4.1a"><mrow id="S3.SS1.p4.4.m4.1.1" xref="S3.SS1.p4.4.m4.1.1.cmml"><mi id="S3.SS1.p4.4.m4.1.1.2" xref="S3.SS1.p4.4.m4.1.1.2.cmml">m</mi><mo id="S3.SS1.p4.4.m4.1.1.1" lspace="0.222em" rspace="0.222em" xref="S3.SS1.p4.4.m4.1.1.1.cmml">×</mo><mi id="S3.SS1.p4.4.m4.1.1.3" xref="S3.SS1.p4.4.m4.1.1.3.cmml">n</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.4.m4.1b"><apply id="S3.SS1.p4.4.m4.1.1.cmml" xref="S3.SS1.p4.4.m4.1.1"><times id="S3.SS1.p4.4.m4.1.1.1.cmml" xref="S3.SS1.p4.4.m4.1.1.1"></times><ci id="S3.SS1.p4.4.m4.1.1.2.cmml" xref="S3.SS1.p4.4.m4.1.1.2">𝑚</ci><ci id="S3.SS1.p4.4.m4.1.1.3.cmml" xref="S3.SS1.p4.4.m4.1.1.3">𝑛</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.4.m4.1c">m\times n</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.4.m4.1d">italic_m × italic_n</annotation></semantics></math> matrix <math alttext="M" class="ltx_Math" display="inline" id="S3.SS1.p4.5.m5.1"><semantics id="S3.SS1.p4.5.m5.1a"><mi id="S3.SS1.p4.5.m5.1.1" xref="S3.SS1.p4.5.m5.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.5.m5.1b"><ci id="S3.SS1.p4.5.m5.1.1.cmml" xref="S3.SS1.p4.5.m5.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.5.m5.1c">M</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.5.m5.1d">italic_M</annotation></semantics></math>, where <math alttext="m" class="ltx_Math" display="inline" id="S3.SS1.p4.6.m6.1"><semantics id="S3.SS1.p4.6.m6.1a"><mi id="S3.SS1.p4.6.m6.1.1" xref="S3.SS1.p4.6.m6.1.1.cmml">m</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.6.m6.1b"><ci id="S3.SS1.p4.6.m6.1.1.cmml" xref="S3.SS1.p4.6.m6.1.1">𝑚</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.6.m6.1c">m</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.6.m6.1d">italic_m</annotation></semantics></math> is the number of textual clues. After that, we normalize <math alttext="M" class="ltx_Math" display="inline" id="S3.SS1.p4.7.m7.1"><semantics id="S3.SS1.p4.7.m7.1a"><mi id="S3.SS1.p4.7.m7.1.1" xref="S3.SS1.p4.7.m7.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.7.m7.1b"><ci id="S3.SS1.p4.7.m7.1.1.cmml" xref="S3.SS1.p4.7.m7.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.7.m7.1c">M</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.7.m7.1d">italic_M</annotation></semantics></math> using min-max normalization, and set values lower than the threshold to zero, resulting in another <math alttext="m\times n" class="ltx_Math" display="inline" id="S3.SS1.p4.8.m8.1"><semantics id="S3.SS1.p4.8.m8.1a"><mrow id="S3.SS1.p4.8.m8.1.1" xref="S3.SS1.p4.8.m8.1.1.cmml"><mi id="S3.SS1.p4.8.m8.1.1.2" xref="S3.SS1.p4.8.m8.1.1.2.cmml">m</mi><mo id="S3.SS1.p4.8.m8.1.1.1" lspace="0.222em" rspace="0.222em" xref="S3.SS1.p4.8.m8.1.1.1.cmml">×</mo><mi id="S3.SS1.p4.8.m8.1.1.3" xref="S3.SS1.p4.8.m8.1.1.3.cmml">n</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.8.m8.1b"><apply id="S3.SS1.p4.8.m8.1.1.cmml" xref="S3.SS1.p4.8.m8.1.1"><times id="S3.SS1.p4.8.m8.1.1.1.cmml" xref="S3.SS1.p4.8.m8.1.1.1"></times><ci id="S3.SS1.p4.8.m8.1.1.2.cmml" xref="S3.SS1.p4.8.m8.1.1.2">𝑚</ci><ci id="S3.SS1.p4.8.m8.1.1.3.cmml" xref="S3.SS1.p4.8.m8.1.1.3">𝑛</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.8.m8.1c">m\times n</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.8.m8.1d">italic_m × italic_n</annotation></semantics></math> matrix <math alttext="\hat{M}" class="ltx_Math" display="inline" id="S3.SS1.p4.9.m9.1"><semantics id="S3.SS1.p4.9.m9.1a"><mover accent="true" id="S3.SS1.p4.9.m9.1.1" xref="S3.SS1.p4.9.m9.1.1.cmml"><mi id="S3.SS1.p4.9.m9.1.1.2" xref="S3.SS1.p4.9.m9.1.1.2.cmml">M</mi><mo id="S3.SS1.p4.9.m9.1.1.1" xref="S3.SS1.p4.9.m9.1.1.1.cmml">^</mo></mover><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.9.m9.1b"><apply id="S3.SS1.p4.9.m9.1.1.cmml" xref="S3.SS1.p4.9.m9.1.1"><ci id="S3.SS1.p4.9.m9.1.1.1.cmml" xref="S3.SS1.p4.9.m9.1.1.1">^</ci><ci id="S3.SS1.p4.9.m9.1.1.2.cmml" xref="S3.SS1.p4.9.m9.1.1.2">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.9.m9.1c">\hat{M}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.9.m9.1d">over^ start_ARG italic_M end_ARG</annotation></semantics></math>. We reduce <math alttext="\hat{M}" class="ltx_Math" display="inline" id="S3.SS1.p4.10.m10.1"><semantics id="S3.SS1.p4.10.m10.1a"><mover accent="true" id="S3.SS1.p4.10.m10.1.1" xref="S3.SS1.p4.10.m10.1.1.cmml"><mi id="S3.SS1.p4.10.m10.1.1.2" xref="S3.SS1.p4.10.m10.1.1.2.cmml">M</mi><mo id="S3.SS1.p4.10.m10.1.1.1" xref="S3.SS1.p4.10.m10.1.1.1.cmml">^</mo></mover><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.10.m10.1b"><apply id="S3.SS1.p4.10.m10.1.1.cmml" xref="S3.SS1.p4.10.m10.1.1"><ci id="S3.SS1.p4.10.m10.1.1.1.cmml" xref="S3.SS1.p4.10.m10.1.1.1">^</ci><ci id="S3.SS1.p4.10.m10.1.1.2.cmml" xref="S3.SS1.p4.10.m10.1.1.2">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.10.m10.1c">\hat{M}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.10.m10.1d">over^ start_ARG italic_M end_ARG</annotation></semantics></math> to an <math alttext="n" class="ltx_Math" display="inline" id="S3.SS1.p4.11.m11.1"><semantics id="S3.SS1.p4.11.m11.1a"><mi id="S3.SS1.p4.11.m11.1.1" xref="S3.SS1.p4.11.m11.1.1.cmml">n</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.11.m11.1b"><ci id="S3.SS1.p4.11.m11.1.1.cmml" xref="S3.SS1.p4.11.m11.1.1">𝑛</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.11.m11.1c">n</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.11.m11.1d">italic_n</annotation></semantics></math>-length vector by calculating the mean across its rows, and then normalize it to obtain <math alttext="\textbf{w}_{loc}" class="ltx_Math" display="inline" id="S3.SS1.p4.12.m12.1"><semantics id="S3.SS1.p4.12.m12.1a"><msub id="S3.SS1.p4.12.m12.1.1" xref="S3.SS1.p4.12.m12.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.12.m12.1.1.2" xref="S3.SS1.p4.12.m12.1.1.2a.cmml">w</mtext><mrow id="S3.SS1.p4.12.m12.1.1.3" xref="S3.SS1.p4.12.m12.1.1.3.cmml"><mi id="S3.SS1.p4.12.m12.1.1.3.2" xref="S3.SS1.p4.12.m12.1.1.3.2.cmml">l</mi><mo id="S3.SS1.p4.12.m12.1.1.3.1" xref="S3.SS1.p4.12.m12.1.1.3.1.cmml"></mo><mi id="S3.SS1.p4.12.m12.1.1.3.3" xref="S3.SS1.p4.12.m12.1.1.3.3.cmml">o</mi><mo id="S3.SS1.p4.12.m12.1.1.3.1a" xref="S3.SS1.p4.12.m12.1.1.3.1.cmml"></mo><mi id="S3.SS1.p4.12.m12.1.1.3.4" xref="S3.SS1.p4.12.m12.1.1.3.4.cmml">c</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.12.m12.1b"><apply id="S3.SS1.p4.12.m12.1.1.cmml" xref="S3.SS1.p4.12.m12.1.1"><csymbol cd="ambiguous" id="S3.SS1.p4.12.m12.1.1.1.cmml" xref="S3.SS1.p4.12.m12.1.1">subscript</csymbol><ci id="S3.SS1.p4.12.m12.1.1.2a.cmml" xref="S3.SS1.p4.12.m12.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.12.m12.1.1.2.cmml" xref="S3.SS1.p4.12.m12.1.1.2">w</mtext></ci><apply id="S3.SS1.p4.12.m12.1.1.3.cmml" xref="S3.SS1.p4.12.m12.1.1.3"><times id="S3.SS1.p4.12.m12.1.1.3.1.cmml" xref="S3.SS1.p4.12.m12.1.1.3.1"></times><ci id="S3.SS1.p4.12.m12.1.1.3.2.cmml" xref="S3.SS1.p4.12.m12.1.1.3.2">𝑙</ci><ci id="S3.SS1.p4.12.m12.1.1.3.3.cmml" xref="S3.SS1.p4.12.m12.1.1.3.3">𝑜</ci><ci id="S3.SS1.p4.12.m12.1.1.3.4.cmml" xref="S3.SS1.p4.12.m12.1.1.3.4">𝑐</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.12.m12.1c">\textbf{w}_{loc}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.12.m12.1d">w start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT</annotation></semantics></math>. This vector represents the importance of each semantic segmentation label for geo-localization. With the segmentation mask ratio <math alttext="\textbf{I}_{seg}" class="ltx_Math" display="inline" id="S3.SS1.p4.13.m13.1"><semantics id="S3.SS1.p4.13.m13.1a"><msub id="S3.SS1.p4.13.m13.1.1" xref="S3.SS1.p4.13.m13.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.13.m13.1.1.2" xref="S3.SS1.p4.13.m13.1.1.2a.cmml">I</mtext><mrow id="S3.SS1.p4.13.m13.1.1.3" xref="S3.SS1.p4.13.m13.1.1.3.cmml"><mi id="S3.SS1.p4.13.m13.1.1.3.2" xref="S3.SS1.p4.13.m13.1.1.3.2.cmml">s</mi><mo id="S3.SS1.p4.13.m13.1.1.3.1" xref="S3.SS1.p4.13.m13.1.1.3.1.cmml"></mo><mi id="S3.SS1.p4.13.m13.1.1.3.3" xref="S3.SS1.p4.13.m13.1.1.3.3.cmml">e</mi><mo id="S3.SS1.p4.13.m13.1.1.3.1a" xref="S3.SS1.p4.13.m13.1.1.3.1.cmml"></mo><mi id="S3.SS1.p4.13.m13.1.1.3.4" xref="S3.SS1.p4.13.m13.1.1.3.4.cmml">g</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.13.m13.1b"><apply id="S3.SS1.p4.13.m13.1.1.cmml" xref="S3.SS1.p4.13.m13.1.1"><csymbol cd="ambiguous" id="S3.SS1.p4.13.m13.1.1.1.cmml" xref="S3.SS1.p4.13.m13.1.1">subscript</csymbol><ci id="S3.SS1.p4.13.m13.1.1.2a.cmml" xref="S3.SS1.p4.13.m13.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.13.m13.1.1.2.cmml" xref="S3.SS1.p4.13.m13.1.1.2">I</mtext></ci><apply id="S3.SS1.p4.13.m13.1.1.3.cmml" xref="S3.SS1.p4.13.m13.1.1.3"><times id="S3.SS1.p4.13.m13.1.1.3.1.cmml" xref="S3.SS1.p4.13.m13.1.1.3.1"></times><ci id="S3.SS1.p4.13.m13.1.1.3.2.cmml" xref="S3.SS1.p4.13.m13.1.1.3.2">𝑠</ci><ci id="S3.SS1.p4.13.m13.1.1.3.3.cmml" xref="S3.SS1.p4.13.m13.1.1.3.3">𝑒</ci><ci id="S3.SS1.p4.13.m13.1.1.3.4.cmml" xref="S3.SS1.p4.13.m13.1.1.3.4">𝑔</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.13.m13.1c">\textbf{I}_{seg}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.13.m13.1d">I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT</annotation></semantics></math> and the corresponding weight <math alttext="\textbf{w}_{loc}" class="ltx_Math" display="inline" id="S3.SS1.p4.14.m14.1"><semantics id="S3.SS1.p4.14.m14.1a"><msub id="S3.SS1.p4.14.m14.1.1" xref="S3.SS1.p4.14.m14.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.14.m14.1.1.2" xref="S3.SS1.p4.14.m14.1.1.2a.cmml">w</mtext><mrow id="S3.SS1.p4.14.m14.1.1.3" xref="S3.SS1.p4.14.m14.1.1.3.cmml"><mi id="S3.SS1.p4.14.m14.1.1.3.2" xref="S3.SS1.p4.14.m14.1.1.3.2.cmml">l</mi><mo id="S3.SS1.p4.14.m14.1.1.3.1" xref="S3.SS1.p4.14.m14.1.1.3.1.cmml"></mo><mi id="S3.SS1.p4.14.m14.1.1.3.3" xref="S3.SS1.p4.14.m14.1.1.3.3.cmml">o</mi><mo id="S3.SS1.p4.14.m14.1.1.3.1a" xref="S3.SS1.p4.14.m14.1.1.3.1.cmml"></mo><mi id="S3.SS1.p4.14.m14.1.1.3.4" xref="S3.SS1.p4.14.m14.1.1.3.4.cmml">c</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.14.m14.1b"><apply id="S3.SS1.p4.14.m14.1.1.cmml" xref="S3.SS1.p4.14.m14.1.1"><csymbol cd="ambiguous" id="S3.SS1.p4.14.m14.1.1.1.cmml" xref="S3.SS1.p4.14.m14.1.1">subscript</csymbol><ci id="S3.SS1.p4.14.m14.1.1.2a.cmml" xref="S3.SS1.p4.14.m14.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.14.m14.1.1.2.cmml" xref="S3.SS1.p4.14.m14.1.1.2">w</mtext></ci><apply id="S3.SS1.p4.14.m14.1.1.3.cmml" xref="S3.SS1.p4.14.m14.1.1.3"><times id="S3.SS1.p4.14.m14.1.1.3.1.cmml" xref="S3.SS1.p4.14.m14.1.1.3.1"></times><ci id="S3.SS1.p4.14.m14.1.1.3.2.cmml" xref="S3.SS1.p4.14.m14.1.1.3.2">𝑙</ci><ci id="S3.SS1.p4.14.m14.1.1.3.3.cmml" xref="S3.SS1.p4.14.m14.1.1.3.3">𝑜</ci><ci id="S3.SS1.p4.14.m14.1.1.3.4.cmml" xref="S3.SS1.p4.14.m14.1.1.3.4">𝑐</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.14.m14.1c">\textbf{w}_{loc}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.14.m14.1d">w start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT</annotation></semantics></math>, the locatability metric of a GSV image is computed through the multiplication and accumulation of the respective values, as follows:</p> <table class="ltx_equation ltx_eqn_table" id="S3.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="locatability(\textbf{I}_{seg},\textbf{w}_{loc})=\sum_{k=1}^{n}\textbf{I}_{seg}% (k)\cdot\textbf{w}^{k}_{loc},\vspace{-3mm}" class="ltx_Math" display="block" id="S3.E1.m1.2"><semantics id="S3.E1.m1.2a"><mrow id="S3.E1.m1.2.2.1" xref="S3.E1.m1.2.2.1.1.cmml"><mrow id="S3.E1.m1.2.2.1.1" xref="S3.E1.m1.2.2.1.1.cmml"><mrow id="S3.E1.m1.2.2.1.1.2" xref="S3.E1.m1.2.2.1.1.2.cmml"><mi id="S3.E1.m1.2.2.1.1.2.4" xref="S3.E1.m1.2.2.1.1.2.4.cmml">l</mi><mo id="S3.E1.m1.2.2.1.1.2.3" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.5" xref="S3.E1.m1.2.2.1.1.2.5.cmml">o</mi><mo id="S3.E1.m1.2.2.1.1.2.3a" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.6" xref="S3.E1.m1.2.2.1.1.2.6.cmml">c</mi><mo id="S3.E1.m1.2.2.1.1.2.3b" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.7" xref="S3.E1.m1.2.2.1.1.2.7.cmml">a</mi><mo id="S3.E1.m1.2.2.1.1.2.3c" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.8" xref="S3.E1.m1.2.2.1.1.2.8.cmml">t</mi><mo id="S3.E1.m1.2.2.1.1.2.3d" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.9" xref="S3.E1.m1.2.2.1.1.2.9.cmml">a</mi><mo id="S3.E1.m1.2.2.1.1.2.3e" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.10" xref="S3.E1.m1.2.2.1.1.2.10.cmml">b</mi><mo id="S3.E1.m1.2.2.1.1.2.3f" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.11" xref="S3.E1.m1.2.2.1.1.2.11.cmml">i</mi><mo id="S3.E1.m1.2.2.1.1.2.3g" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.12" xref="S3.E1.m1.2.2.1.1.2.12.cmml">l</mi><mo id="S3.E1.m1.2.2.1.1.2.3h" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.13" xref="S3.E1.m1.2.2.1.1.2.13.cmml">i</mi><mo id="S3.E1.m1.2.2.1.1.2.3i" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.14" xref="S3.E1.m1.2.2.1.1.2.14.cmml">t</mi><mo id="S3.E1.m1.2.2.1.1.2.3j" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.15" xref="S3.E1.m1.2.2.1.1.2.15.cmml">y</mi><mo id="S3.E1.m1.2.2.1.1.2.3k" xref="S3.E1.m1.2.2.1.1.2.3.cmml"></mo><mrow id="S3.E1.m1.2.2.1.1.2.2.2" xref="S3.E1.m1.2.2.1.1.2.2.3.cmml"><mo id="S3.E1.m1.2.2.1.1.2.2.2.3" stretchy="false" xref="S3.E1.m1.2.2.1.1.2.2.3.cmml">(</mo><msub id="S3.E1.m1.2.2.1.1.1.1.1.1" xref="S3.E1.m1.2.2.1.1.1.1.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.1.1.1.1.2" xref="S3.E1.m1.2.2.1.1.1.1.1.1.2a.cmml">I</mtext><mrow id="S3.E1.m1.2.2.1.1.1.1.1.1.3" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.cmml"><mi id="S3.E1.m1.2.2.1.1.1.1.1.1.3.2" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.2.cmml">s</mi><mo id="S3.E1.m1.2.2.1.1.1.1.1.1.3.1" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.1.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.1.1.1.1.3.3" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.3.cmml">e</mi><mo id="S3.E1.m1.2.2.1.1.1.1.1.1.3.1a" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.1.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.1.1.1.1.3.4" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.4.cmml">g</mi></mrow></msub><mo id="S3.E1.m1.2.2.1.1.2.2.2.4" xref="S3.E1.m1.2.2.1.1.2.2.3.cmml">,</mo><msub id="S3.E1.m1.2.2.1.1.2.2.2.2" xref="S3.E1.m1.2.2.1.1.2.2.2.2.cmml"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.2.2.2.2.2" xref="S3.E1.m1.2.2.1.1.2.2.2.2.2a.cmml">w</mtext><mrow id="S3.E1.m1.2.2.1.1.2.2.2.2.3" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.cmml"><mi id="S3.E1.m1.2.2.1.1.2.2.2.2.3.2" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.2.cmml">l</mi><mo id="S3.E1.m1.2.2.1.1.2.2.2.2.3.1" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.1.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.2.2.2.3.3" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.3.cmml">o</mi><mo id="S3.E1.m1.2.2.1.1.2.2.2.2.3.1a" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.1.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.2.2.2.2.3.4" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.4.cmml">c</mi></mrow></msub><mo id="S3.E1.m1.2.2.1.1.2.2.2.5" stretchy="false" xref="S3.E1.m1.2.2.1.1.2.2.3.cmml">)</mo></mrow></mrow><mo id="S3.E1.m1.2.2.1.1.3" rspace="0.111em" xref="S3.E1.m1.2.2.1.1.3.cmml">=</mo><mrow id="S3.E1.m1.2.2.1.1.4" xref="S3.E1.m1.2.2.1.1.4.cmml"><munderover id="S3.E1.m1.2.2.1.1.4.1" xref="S3.E1.m1.2.2.1.1.4.1.cmml"><mo id="S3.E1.m1.2.2.1.1.4.1.2.2" movablelimits="false" xref="S3.E1.m1.2.2.1.1.4.1.2.2.cmml">∑</mo><mrow id="S3.E1.m1.2.2.1.1.4.1.2.3" xref="S3.E1.m1.2.2.1.1.4.1.2.3.cmml"><mi id="S3.E1.m1.2.2.1.1.4.1.2.3.2" xref="S3.E1.m1.2.2.1.1.4.1.2.3.2.cmml">k</mi><mo id="S3.E1.m1.2.2.1.1.4.1.2.3.1" xref="S3.E1.m1.2.2.1.1.4.1.2.3.1.cmml">=</mo><mn id="S3.E1.m1.2.2.1.1.4.1.2.3.3" xref="S3.E1.m1.2.2.1.1.4.1.2.3.3.cmml">1</mn></mrow><mi id="S3.E1.m1.2.2.1.1.4.1.3" xref="S3.E1.m1.2.2.1.1.4.1.3.cmml">n</mi></munderover><mrow id="S3.E1.m1.2.2.1.1.4.2" xref="S3.E1.m1.2.2.1.1.4.2.cmml"><mrow id="S3.E1.m1.2.2.1.1.4.2.2" xref="S3.E1.m1.2.2.1.1.4.2.2.cmml"><msub id="S3.E1.m1.2.2.1.1.4.2.2.2" xref="S3.E1.m1.2.2.1.1.4.2.2.2.cmml"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.4.2.2.2.2" xref="S3.E1.m1.2.2.1.1.4.2.2.2.2a.cmml">I</mtext><mrow id="S3.E1.m1.2.2.1.1.4.2.2.2.3" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.cmml"><mi id="S3.E1.m1.2.2.1.1.4.2.2.2.3.2" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.2.cmml">s</mi><mo id="S3.E1.m1.2.2.1.1.4.2.2.2.3.1" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.1.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.4.2.2.2.3.3" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.3.cmml">e</mi><mo id="S3.E1.m1.2.2.1.1.4.2.2.2.3.1a" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.1.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.4.2.2.2.3.4" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.4.cmml">g</mi></mrow></msub><mo id="S3.E1.m1.2.2.1.1.4.2.2.1" xref="S3.E1.m1.2.2.1.1.4.2.2.1.cmml"></mo><mrow id="S3.E1.m1.2.2.1.1.4.2.2.3.2" xref="S3.E1.m1.2.2.1.1.4.2.2.cmml"><mo id="S3.E1.m1.2.2.1.1.4.2.2.3.2.1" stretchy="false" xref="S3.E1.m1.2.2.1.1.4.2.2.cmml">(</mo><mi id="S3.E1.m1.1.1" xref="S3.E1.m1.1.1.cmml">k</mi><mo id="S3.E1.m1.2.2.1.1.4.2.2.3.2.2" rspace="0.055em" stretchy="false" xref="S3.E1.m1.2.2.1.1.4.2.2.cmml">)</mo></mrow></mrow><mo id="S3.E1.m1.2.2.1.1.4.2.1" rspace="0.222em" xref="S3.E1.m1.2.2.1.1.4.2.1.cmml">⋅</mo><msubsup id="S3.E1.m1.2.2.1.1.4.2.3" xref="S3.E1.m1.2.2.1.1.4.2.3.cmml"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.4.2.3.2.2" xref="S3.E1.m1.2.2.1.1.4.2.3.2.2a.cmml">w</mtext><mrow id="S3.E1.m1.2.2.1.1.4.2.3.3" xref="S3.E1.m1.2.2.1.1.4.2.3.3.cmml"><mi id="S3.E1.m1.2.2.1.1.4.2.3.3.2" xref="S3.E1.m1.2.2.1.1.4.2.3.3.2.cmml">l</mi><mo id="S3.E1.m1.2.2.1.1.4.2.3.3.1" xref="S3.E1.m1.2.2.1.1.4.2.3.3.1.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.4.2.3.3.3" xref="S3.E1.m1.2.2.1.1.4.2.3.3.3.cmml">o</mi><mo id="S3.E1.m1.2.2.1.1.4.2.3.3.1a" xref="S3.E1.m1.2.2.1.1.4.2.3.3.1.cmml"></mo><mi id="S3.E1.m1.2.2.1.1.4.2.3.3.4" xref="S3.E1.m1.2.2.1.1.4.2.3.3.4.cmml">c</mi></mrow><mi id="S3.E1.m1.2.2.1.1.4.2.3.2.3" xref="S3.E1.m1.2.2.1.1.4.2.3.2.3.cmml">k</mi></msubsup></mrow></mrow></mrow><mo id="S3.E1.m1.2.2.1.2" xref="S3.E1.m1.2.2.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E1.m1.2b"><apply id="S3.E1.m1.2.2.1.1.cmml" xref="S3.E1.m1.2.2.1"><eq id="S3.E1.m1.2.2.1.1.3.cmml" xref="S3.E1.m1.2.2.1.1.3"></eq><apply id="S3.E1.m1.2.2.1.1.2.cmml" xref="S3.E1.m1.2.2.1.1.2"><times id="S3.E1.m1.2.2.1.1.2.3.cmml" xref="S3.E1.m1.2.2.1.1.2.3"></times><ci id="S3.E1.m1.2.2.1.1.2.4.cmml" xref="S3.E1.m1.2.2.1.1.2.4">𝑙</ci><ci id="S3.E1.m1.2.2.1.1.2.5.cmml" xref="S3.E1.m1.2.2.1.1.2.5">𝑜</ci><ci id="S3.E1.m1.2.2.1.1.2.6.cmml" xref="S3.E1.m1.2.2.1.1.2.6">𝑐</ci><ci id="S3.E1.m1.2.2.1.1.2.7.cmml" xref="S3.E1.m1.2.2.1.1.2.7">𝑎</ci><ci id="S3.E1.m1.2.2.1.1.2.8.cmml" xref="S3.E1.m1.2.2.1.1.2.8">𝑡</ci><ci id="S3.E1.m1.2.2.1.1.2.9.cmml" xref="S3.E1.m1.2.2.1.1.2.9">𝑎</ci><ci id="S3.E1.m1.2.2.1.1.2.10.cmml" xref="S3.E1.m1.2.2.1.1.2.10">𝑏</ci><ci id="S3.E1.m1.2.2.1.1.2.11.cmml" xref="S3.E1.m1.2.2.1.1.2.11">𝑖</ci><ci id="S3.E1.m1.2.2.1.1.2.12.cmml" xref="S3.E1.m1.2.2.1.1.2.12">𝑙</ci><ci id="S3.E1.m1.2.2.1.1.2.13.cmml" xref="S3.E1.m1.2.2.1.1.2.13">𝑖</ci><ci id="S3.E1.m1.2.2.1.1.2.14.cmml" xref="S3.E1.m1.2.2.1.1.2.14">𝑡</ci><ci id="S3.E1.m1.2.2.1.1.2.15.cmml" xref="S3.E1.m1.2.2.1.1.2.15">𝑦</ci><interval closure="open" id="S3.E1.m1.2.2.1.1.2.2.3.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2"><apply id="S3.E1.m1.2.2.1.1.1.1.1.1.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E1.m1.2.2.1.1.1.1.1.1.2a.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.2">I</mtext></ci><apply id="S3.E1.m1.2.2.1.1.1.1.1.1.3.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3"><times id="S3.E1.m1.2.2.1.1.1.1.1.1.3.1.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.1"></times><ci id="S3.E1.m1.2.2.1.1.1.1.1.1.3.2.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.2">𝑠</ci><ci id="S3.E1.m1.2.2.1.1.1.1.1.1.3.3.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.3">𝑒</ci><ci id="S3.E1.m1.2.2.1.1.1.1.1.1.3.4.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.4">𝑔</ci></apply></apply><apply id="S3.E1.m1.2.2.1.1.2.2.2.2.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.2.2.2.2.1.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2">subscript</csymbol><ci id="S3.E1.m1.2.2.1.1.2.2.2.2.2a.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.2"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.2.2.2.2.2.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.2">w</mtext></ci><apply id="S3.E1.m1.2.2.1.1.2.2.2.2.3.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3"><times id="S3.E1.m1.2.2.1.1.2.2.2.2.3.1.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.1"></times><ci id="S3.E1.m1.2.2.1.1.2.2.2.2.3.2.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.2">𝑙</ci><ci id="S3.E1.m1.2.2.1.1.2.2.2.2.3.3.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.3">𝑜</ci><ci id="S3.E1.m1.2.2.1.1.2.2.2.2.3.4.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.4">𝑐</ci></apply></apply></interval></apply><apply id="S3.E1.m1.2.2.1.1.4.cmml" xref="S3.E1.m1.2.2.1.1.4"><apply id="S3.E1.m1.2.2.1.1.4.1.cmml" xref="S3.E1.m1.2.2.1.1.4.1"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.4.1.1.cmml" xref="S3.E1.m1.2.2.1.1.4.1">superscript</csymbol><apply id="S3.E1.m1.2.2.1.1.4.1.2.cmml" xref="S3.E1.m1.2.2.1.1.4.1"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.4.1.2.1.cmml" xref="S3.E1.m1.2.2.1.1.4.1">subscript</csymbol><sum id="S3.E1.m1.2.2.1.1.4.1.2.2.cmml" xref="S3.E1.m1.2.2.1.1.4.1.2.2"></sum><apply id="S3.E1.m1.2.2.1.1.4.1.2.3.cmml" xref="S3.E1.m1.2.2.1.1.4.1.2.3"><eq id="S3.E1.m1.2.2.1.1.4.1.2.3.1.cmml" xref="S3.E1.m1.2.2.1.1.4.1.2.3.1"></eq><ci id="S3.E1.m1.2.2.1.1.4.1.2.3.2.cmml" xref="S3.E1.m1.2.2.1.1.4.1.2.3.2">𝑘</ci><cn id="S3.E1.m1.2.2.1.1.4.1.2.3.3.cmml" type="integer" xref="S3.E1.m1.2.2.1.1.4.1.2.3.3">1</cn></apply></apply><ci id="S3.E1.m1.2.2.1.1.4.1.3.cmml" xref="S3.E1.m1.2.2.1.1.4.1.3">𝑛</ci></apply><apply id="S3.E1.m1.2.2.1.1.4.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2"><ci id="S3.E1.m1.2.2.1.1.4.2.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.1">⋅</ci><apply id="S3.E1.m1.2.2.1.1.4.2.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2"><times id="S3.E1.m1.2.2.1.1.4.2.2.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.1"></times><apply id="S3.E1.m1.2.2.1.1.4.2.2.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.4.2.2.2.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2">subscript</csymbol><ci id="S3.E1.m1.2.2.1.1.4.2.2.2.2a.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.2"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.4.2.2.2.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.2">I</mtext></ci><apply id="S3.E1.m1.2.2.1.1.4.2.2.2.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3"><times id="S3.E1.m1.2.2.1.1.4.2.2.2.3.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.1"></times><ci id="S3.E1.m1.2.2.1.1.4.2.2.2.3.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.2">𝑠</ci><ci id="S3.E1.m1.2.2.1.1.4.2.2.2.3.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.3">𝑒</ci><ci id="S3.E1.m1.2.2.1.1.4.2.2.2.3.4.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.4">𝑔</ci></apply></apply><ci id="S3.E1.m1.1.1.cmml" xref="S3.E1.m1.1.1">𝑘</ci></apply><apply id="S3.E1.m1.2.2.1.1.4.2.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.4.2.3.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3">subscript</csymbol><apply id="S3.E1.m1.2.2.1.1.4.2.3.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.4.2.3.2.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3">superscript</csymbol><ci id="S3.E1.m1.2.2.1.1.4.2.3.2.2a.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.2.2"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.4.2.3.2.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.2.2">w</mtext></ci><ci id="S3.E1.m1.2.2.1.1.4.2.3.2.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.2.3">𝑘</ci></apply><apply id="S3.E1.m1.2.2.1.1.4.2.3.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.3"><times id="S3.E1.m1.2.2.1.1.4.2.3.3.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.3.1"></times><ci id="S3.E1.m1.2.2.1.1.4.2.3.3.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.3.2">𝑙</ci><ci id="S3.E1.m1.2.2.1.1.4.2.3.3.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.3.3">𝑜</ci><ci id="S3.E1.m1.2.2.1.1.4.2.3.3.4.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.3.4">𝑐</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E1.m1.2c">locatability(\textbf{I}_{seg},\textbf{w}_{loc})=\sum_{k=1}^{n}\textbf{I}_{seg}% (k)\cdot\textbf{w}^{k}_{loc},\vspace{-3mm}</annotation><annotation encoding="application/x-llamapun" id="S3.E1.m1.2d">italic_l italic_o italic_c italic_a italic_t italic_a italic_b italic_i italic_l italic_i italic_t italic_y ( I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT , w start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT ( italic_k ) ⋅ w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td> </tr></tbody> </table> </div> <div class="ltx_para" id="S3.SS1.p5"> <p class="ltx_p" id="S3.SS1.p5.3">where <math alttext="\textbf{I}_{seg}(k)" class="ltx_Math" display="inline" id="S3.SS1.p5.1.m1.1"><semantics id="S3.SS1.p5.1.m1.1a"><mrow id="S3.SS1.p5.1.m1.1.2" xref="S3.SS1.p5.1.m1.1.2.cmml"><msub id="S3.SS1.p5.1.m1.1.2.2" xref="S3.SS1.p5.1.m1.1.2.2.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p5.1.m1.1.2.2.2" xref="S3.SS1.p5.1.m1.1.2.2.2a.cmml">I</mtext><mrow id="S3.SS1.p5.1.m1.1.2.2.3" xref="S3.SS1.p5.1.m1.1.2.2.3.cmml"><mi id="S3.SS1.p5.1.m1.1.2.2.3.2" xref="S3.SS1.p5.1.m1.1.2.2.3.2.cmml">s</mi><mo id="S3.SS1.p5.1.m1.1.2.2.3.1" xref="S3.SS1.p5.1.m1.1.2.2.3.1.cmml"></mo><mi id="S3.SS1.p5.1.m1.1.2.2.3.3" xref="S3.SS1.p5.1.m1.1.2.2.3.3.cmml">e</mi><mo id="S3.SS1.p5.1.m1.1.2.2.3.1a" xref="S3.SS1.p5.1.m1.1.2.2.3.1.cmml"></mo><mi id="S3.SS1.p5.1.m1.1.2.2.3.4" xref="S3.SS1.p5.1.m1.1.2.2.3.4.cmml">g</mi></mrow></msub><mo id="S3.SS1.p5.1.m1.1.2.1" xref="S3.SS1.p5.1.m1.1.2.1.cmml"></mo><mrow id="S3.SS1.p5.1.m1.1.2.3.2" xref="S3.SS1.p5.1.m1.1.2.cmml"><mo id="S3.SS1.p5.1.m1.1.2.3.2.1" stretchy="false" xref="S3.SS1.p5.1.m1.1.2.cmml">(</mo><mi id="S3.SS1.p5.1.m1.1.1" xref="S3.SS1.p5.1.m1.1.1.cmml">k</mi><mo id="S3.SS1.p5.1.m1.1.2.3.2.2" stretchy="false" xref="S3.SS1.p5.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p5.1.m1.1b"><apply id="S3.SS1.p5.1.m1.1.2.cmml" xref="S3.SS1.p5.1.m1.1.2"><times id="S3.SS1.p5.1.m1.1.2.1.cmml" xref="S3.SS1.p5.1.m1.1.2.1"></times><apply id="S3.SS1.p5.1.m1.1.2.2.cmml" xref="S3.SS1.p5.1.m1.1.2.2"><csymbol cd="ambiguous" id="S3.SS1.p5.1.m1.1.2.2.1.cmml" xref="S3.SS1.p5.1.m1.1.2.2">subscript</csymbol><ci id="S3.SS1.p5.1.m1.1.2.2.2a.cmml" xref="S3.SS1.p5.1.m1.1.2.2.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p5.1.m1.1.2.2.2.cmml" xref="S3.SS1.p5.1.m1.1.2.2.2">I</mtext></ci><apply id="S3.SS1.p5.1.m1.1.2.2.3.cmml" xref="S3.SS1.p5.1.m1.1.2.2.3"><times id="S3.SS1.p5.1.m1.1.2.2.3.1.cmml" xref="S3.SS1.p5.1.m1.1.2.2.3.1"></times><ci id="S3.SS1.p5.1.m1.1.2.2.3.2.cmml" xref="S3.SS1.p5.1.m1.1.2.2.3.2">𝑠</ci><ci id="S3.SS1.p5.1.m1.1.2.2.3.3.cmml" xref="S3.SS1.p5.1.m1.1.2.2.3.3">𝑒</ci><ci id="S3.SS1.p5.1.m1.1.2.2.3.4.cmml" xref="S3.SS1.p5.1.m1.1.2.2.3.4">𝑔</ci></apply></apply><ci id="S3.SS1.p5.1.m1.1.1.cmml" xref="S3.SS1.p5.1.m1.1.1">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p5.1.m1.1c">\textbf{I}_{seg}(k)</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p5.1.m1.1d">I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT ( italic_k )</annotation></semantics></math> denotes pixel ratio of the <math alttext="k" class="ltx_Math" display="inline" id="S3.SS1.p5.2.m2.1"><semantics id="S3.SS1.p5.2.m2.1a"><mi id="S3.SS1.p5.2.m2.1.1" xref="S3.SS1.p5.2.m2.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p5.2.m2.1b"><ci id="S3.SS1.p5.2.m2.1.1.cmml" xref="S3.SS1.p5.2.m2.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p5.2.m2.1c">k</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p5.2.m2.1d">italic_k</annotation></semantics></math>-th class in the segmentation mask <math alttext="\textbf{I}_{seg}" class="ltx_Math" display="inline" id="S3.SS1.p5.3.m3.1"><semantics id="S3.SS1.p5.3.m3.1a"><msub id="S3.SS1.p5.3.m3.1.1" xref="S3.SS1.p5.3.m3.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p5.3.m3.1.1.2" xref="S3.SS1.p5.3.m3.1.1.2a.cmml">I</mtext><mrow id="S3.SS1.p5.3.m3.1.1.3" xref="S3.SS1.p5.3.m3.1.1.3.cmml"><mi id="S3.SS1.p5.3.m3.1.1.3.2" xref="S3.SS1.p5.3.m3.1.1.3.2.cmml">s</mi><mo id="S3.SS1.p5.3.m3.1.1.3.1" xref="S3.SS1.p5.3.m3.1.1.3.1.cmml"></mo><mi id="S3.SS1.p5.3.m3.1.1.3.3" xref="S3.SS1.p5.3.m3.1.1.3.3.cmml">e</mi><mo id="S3.SS1.p5.3.m3.1.1.3.1a" xref="S3.SS1.p5.3.m3.1.1.3.1.cmml"></mo><mi id="S3.SS1.p5.3.m3.1.1.3.4" xref="S3.SS1.p5.3.m3.1.1.3.4.cmml">g</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p5.3.m3.1b"><apply id="S3.SS1.p5.3.m3.1.1.cmml" xref="S3.SS1.p5.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS1.p5.3.m3.1.1.1.cmml" xref="S3.SS1.p5.3.m3.1.1">subscript</csymbol><ci id="S3.SS1.p5.3.m3.1.1.2a.cmml" xref="S3.SS1.p5.3.m3.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p5.3.m3.1.1.2.cmml" xref="S3.SS1.p5.3.m3.1.1.2">I</mtext></ci><apply id="S3.SS1.p5.3.m3.1.1.3.cmml" xref="S3.SS1.p5.3.m3.1.1.3"><times id="S3.SS1.p5.3.m3.1.1.3.1.cmml" xref="S3.SS1.p5.3.m3.1.1.3.1"></times><ci id="S3.SS1.p5.3.m3.1.1.3.2.cmml" xref="S3.SS1.p5.3.m3.1.1.3.2">𝑠</ci><ci id="S3.SS1.p5.3.m3.1.1.3.3.cmml" xref="S3.SS1.p5.3.m3.1.1.3.3">𝑒</ci><ci id="S3.SS1.p5.3.m3.1.1.3.4.cmml" xref="S3.SS1.p5.3.m3.1.1.3.4">𝑔</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p5.3.m3.1c">\textbf{I}_{seg}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p5.3.m3.1d">I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT</annotation></semantics></math>.</p> </div> <div class="ltx_para" id="S3.SS1.p6"> <p class="ltx_p" id="S3.SS1.p6.1">A higher <math alttext="locatability" class="ltx_Math" display="inline" id="S3.SS1.p6.1.m1.1"><semantics id="S3.SS1.p6.1.m1.1a"><mrow id="S3.SS1.p6.1.m1.1.1" xref="S3.SS1.p6.1.m1.1.1.cmml"><mi id="S3.SS1.p6.1.m1.1.1.2" xref="S3.SS1.p6.1.m1.1.1.2.cmml">l</mi><mo id="S3.SS1.p6.1.m1.1.1.1" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.3" xref="S3.SS1.p6.1.m1.1.1.3.cmml">o</mi><mo id="S3.SS1.p6.1.m1.1.1.1a" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.4" xref="S3.SS1.p6.1.m1.1.1.4.cmml">c</mi><mo id="S3.SS1.p6.1.m1.1.1.1b" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.5" xref="S3.SS1.p6.1.m1.1.1.5.cmml">a</mi><mo id="S3.SS1.p6.1.m1.1.1.1c" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.6" xref="S3.SS1.p6.1.m1.1.1.6.cmml">t</mi><mo id="S3.SS1.p6.1.m1.1.1.1d" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.7" xref="S3.SS1.p6.1.m1.1.1.7.cmml">a</mi><mo id="S3.SS1.p6.1.m1.1.1.1e" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.8" xref="S3.SS1.p6.1.m1.1.1.8.cmml">b</mi><mo id="S3.SS1.p6.1.m1.1.1.1f" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.9" xref="S3.SS1.p6.1.m1.1.1.9.cmml">i</mi><mo id="S3.SS1.p6.1.m1.1.1.1g" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.10" xref="S3.SS1.p6.1.m1.1.1.10.cmml">l</mi><mo id="S3.SS1.p6.1.m1.1.1.1h" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.11" xref="S3.SS1.p6.1.m1.1.1.11.cmml">i</mi><mo id="S3.SS1.p6.1.m1.1.1.1i" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.12" xref="S3.SS1.p6.1.m1.1.1.12.cmml">t</mi><mo id="S3.SS1.p6.1.m1.1.1.1j" xref="S3.SS1.p6.1.m1.1.1.1.cmml"></mo><mi id="S3.SS1.p6.1.m1.1.1.13" xref="S3.SS1.p6.1.m1.1.1.13.cmml">y</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p6.1.m1.1b"><apply id="S3.SS1.p6.1.m1.1.1.cmml" xref="S3.SS1.p6.1.m1.1.1"><times id="S3.SS1.p6.1.m1.1.1.1.cmml" xref="S3.SS1.p6.1.m1.1.1.1"></times><ci id="S3.SS1.p6.1.m1.1.1.2.cmml" xref="S3.SS1.p6.1.m1.1.1.2">𝑙</ci><ci id="S3.SS1.p6.1.m1.1.1.3.cmml" xref="S3.SS1.p6.1.m1.1.1.3">𝑜</ci><ci id="S3.SS1.p6.1.m1.1.1.4.cmml" xref="S3.SS1.p6.1.m1.1.1.4">𝑐</ci><ci id="S3.SS1.p6.1.m1.1.1.5.cmml" xref="S3.SS1.p6.1.m1.1.1.5">𝑎</ci><ci id="S3.SS1.p6.1.m1.1.1.6.cmml" xref="S3.SS1.p6.1.m1.1.1.6">𝑡</ci><ci id="S3.SS1.p6.1.m1.1.1.7.cmml" xref="S3.SS1.p6.1.m1.1.1.7">𝑎</ci><ci id="S3.SS1.p6.1.m1.1.1.8.cmml" xref="S3.SS1.p6.1.m1.1.1.8">𝑏</ci><ci id="S3.SS1.p6.1.m1.1.1.9.cmml" xref="S3.SS1.p6.1.m1.1.1.9">𝑖</ci><ci id="S3.SS1.p6.1.m1.1.1.10.cmml" xref="S3.SS1.p6.1.m1.1.1.10">𝑙</ci><ci id="S3.SS1.p6.1.m1.1.1.11.cmml" xref="S3.SS1.p6.1.m1.1.1.11">𝑖</ci><ci id="S3.SS1.p6.1.m1.1.1.12.cmml" xref="S3.SS1.p6.1.m1.1.1.12">𝑡</ci><ci id="S3.SS1.p6.1.m1.1.1.13.cmml" xref="S3.SS1.p6.1.m1.1.1.13">𝑦</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p6.1.m1.1c">locatability</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p6.1.m1.1d">italic_l italic_o italic_c italic_a italic_t italic_a italic_b italic_i italic_l italic_i italic_t italic_y</annotation></semantics></math> value indicates a higher degree of visual clues exhibited in a GSV image for geo-localization, while a lower value suggests the opposite. Empirically, we selected a threshold value of 0.4 for filtering locatable GSV images. This resulted in over 70k highly locatable images with geo-tags passing to the next stage for training an LVLM.</p> </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.2 </span>Geo-localization with Reasoning</h3> <div class="ltx_para" id="S3.SS2.p1"> <p class="ltx_p" id="S3.SS2.p1.1">While many models (<em class="ltx_emph ltx_font_italic" id="S3.SS2.p1.1.1">e.g.</em>, <cite class="ltx_cite ltx_citemacro_citet">Clark et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib8" title="">2023</a>); Pramanick et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib20" title="">2022</a>); Müller-Budack et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib19" title="">2018</a>); Seo et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib25" title="">2018</a>); Weyand et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib32" title="">2016</a>)</cite>) exist for image-based geo-localization, these models typically predict locations without providing the inference process. This introduces several limitations: First, the models operate as black boxes without providing insights, making it challenging for users to interpret. This obstacle impedes further refinement of the geo-localization model. More importantly, studies have demonstrated that integrating the reasoning process can enhance the capabilities of LLMs <cite class="ltx_cite ltx_citemacro_citep">(Qiao et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib21" title="">2023</a>)</cite>. Therefore, our objective is to construct an LVLM for image-based geo-localization with reasoning capability.</p> </div> <div class="ltx_para" id="S3.SS2.p2"> <p class="ltx_p" id="S3.SS2.p2.1"><span class="ltx_text ltx_font_bold" id="S3.SS2.p2.1.1">Model Architecture.</span> Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.F3" title="Figure 3 ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">3</span></a> illustrates the architecture of the proposed model <em class="ltx_emph ltx_font_italic" id="S3.SS2.p2.1.2">GeoReasoner</em>, which is based on Qwen-VL <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib2" title="">2023a</a>)</cite>. <em class="ltx_emph ltx_font_italic" id="S3.SS2.p2.1.3">GeoReasoner</em> consists of three modules: <em class="ltx_emph ltx_font_italic" id="S3.SS2.p2.1.4">Vision Encoder</em>, <em class="ltx_emph ltx_font_italic" id="S3.SS2.p2.1.5">Vision-Language (VL) Adapter</em> and <em class="ltx_emph ltx_font_italic" id="S3.SS2.p2.1.6">Pre-trained LLM</em>. Specifically, the <em class="ltx_emph ltx_font_italic" id="S3.SS2.p2.1.7">Vision Encoder</em> module employs the Vision Transformer (ViT) <cite class="ltx_cite ltx_citemacro_citep">(Dosovitskiy et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib10" title="">2021</a>)</cite> architecture. The input street-view images are resized to a specific resolution and then divided into a set of image patches. To refine image patches into sequential representations compatible with an LLM, the <em class="ltx_emph ltx_font_italic" id="S3.SS2.p2.1.8">VL Adapter</em> is introduced. In the VL Adapter, the sequence of visual features is initially condensed to a fixed length to address efficiency challenges posed by the substantial number of visual feature sequences. Subsequently, the processed visual features are integrated with the LLM using cross-attention mechanisms. Following this, the compressed visual feature sequence and text sequence are passed to the <em class="ltx_emph ltx_font_italic" id="S3.SS2.p2.1.9">Pre-trained LLM</em> module, which functions as a decoder for generating the answer.</p> </div> <div class="ltx_para" id="S3.SS2.p3"> <p class="ltx_p" id="S3.SS2.p3.1"><span class="ltx_text ltx_font_bold" id="S3.SS2.p3.1.1">Supervised Fine-tuning.</span> The overall model undergoes a staged pre-training process that is divided into two folds: reasoning tuning and location tuning. In the first stage, our objective is to enhance the model’s reasoning capability by utilizing textual clues paired with street-view images collected from geo-localization games. The input street-view image & question, and the output answer are formatted as prompts in the following manner:</p> </div> <div class="ltx_para" id="S3.SS2.p4"> <p class="ltx_p ltx_align_center" id="S3.SS2.p4.1"><span class="ltx_text" id="S3.SS2.p4.1.1"><img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_landscape" height="172" id="S3.SS2.p4.1.1.g1" src="x4.png" width="747"/></span></p> </div> <div class="ltx_para" id="S3.SS2.p5"> <p class="ltx_p" id="S3.SS2.p5.1">Here, we can only provide reasoning at the country level due to the granularity exhibited in the image-text pairs. Nevertheless, this reasoning procedure is sufficient to facilitate the second stage of location tuning. Next, we integrate the prior knowledge of country information with highly locatable GSV images with geo-tags to infer the fine-grained city-level location information. We utilize a similar prompt format as in the first stage but without a reasoning requirement. Both stages are fine-tuned from the pre-trained Qwen-VL with LoRA, which contributes to the overall performance improvement of Qwen-VL in both the reasoning and location tuning stages, allowing the model to better capture complex relationships within the image-text pairs.</p> </div> <figure class="ltx_figure" id="S3.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="607" id="S3.F4.1.g1" src="x5.png" width="789"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4: </span>Locatability examples. Top row: the street views are highly locatable by signboards, architectural styles, and landmarks. Bottom row: no visual clues for locating the street views.</figcaption> </figure> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">4 </span>Experiments</h2> <div class="ltx_para" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">We conduct a series of experiments to evaluate the effectiveness of the locatability-enhanced geo-localization dataset (Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS1" title="4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">4.1</span></a>) and the model <em class="ltx_emph ltx_font_italic" id="S4.p1.1.1">GeoReasoner</em> for geo-localization with reasoning (Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2" title="4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">4.2</span></a>).</p> </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.1 </span>Experiments on Locatability-Enhanced Dataset</h3> <figure class="ltx_figure" id="S4.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="278" id="S4.F5.1.g1" src="extracted/5933322/imgs/fig_boxplot_v2.png" width="527"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 5: </span>The relationship between building proportion and the degree of locatability in street views. The locatability metric peaks when the building proportion is approximately 0.2.</figcaption> </figure> <section class="ltx_subsubsection" id="S4.SS1.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection">4.1.1 </span>Qualitative Comparison</h4> <div class="ltx_para" id="S4.SS1.SSS1.p1"> <p class="ltx_p" id="S4.SS1.SSS1.p1.1">Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.F4" title="Figure 4 ‣ 3.2 Geo-localization with Reasoning ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">4</span></a> presents examples of the predicted locatability degrees of different street-view images by our locatability quantization network. The top row showcases street views distinguished by prominent localizable attributes. The left image features the Korean language on a signboard, the middle image captures the distinctive <em class="ltx_emph ltx_font_italic" id="S4.SS1.SSS1.p1.1.1">Art Nouveau</em> architectural style commonly found in Switzerland, and the right image shows an art & design museum in India. In contrast, street views in the bottom row display lower locatability degrees. The left image resembles a tunnel, lacking additional discernible information for accurate localization. Similarly, the middle image is occluded by a wall, and the right image faces common vegetation that is available worldwide.</p> </div> <figure class="ltx_figure" id="S4.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="282" id="S4.F6.1.g1" src="extracted/5933322/imgs/fig5_acc_improve.png" width="568"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 6: </span>Quantitative comparison of country- and city-level geo-localization accuracy by different models trained on mixed datasets with varying proportions of high locatable GSV images.</figcaption> </figure> <figure class="ltx_figure" id="S4.F7"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="549" id="S4.F7.1.g1" src="x6.png" width="813"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 7: </span>Examples of LVLM-based approaches in geo-localization with reasoning. Prediction results matching the ground truth are highlighted in <span class="ltx_text" id="S4.F7.4.1" style="color:#00FF00;">green</span>, while reasons offering valid information are marked in <span class="ltx_text" id="S4.F7.5.2" style="color:#0000FF;">blue</span>.</figcaption> </figure> <div class="ltx_para" id="S4.SS1.SSS1.p2"> <p class="ltx_p" id="S4.SS1.SSS1.p2.1">For the proposed locatability metric in Equation <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.E1" title="Equation 1 ‣ 3.1 Locatability-Enhanced Data Curation ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">1</span></a>, we also evaluated the relationship between building proportion and the degree of locatability of street views. The results are shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F5" title="Figure 5 ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">5</span></a>. The locatability metric slightly increases as the building proportion ranges from 0 to 0.2, but decreases as the building proportion continues to increase. The results indicate that buildings are not the sole determinant of locatability. As the proportion of buildings increases, the street-view images transition from panoramic to close-up views, leading to reduced information availability and consequently diminishing the degree of locatability. </p> </div> <div class="ltx_para" id="S4.SS1.SSS1.p3"> <p class="ltx_p" id="S4.SS1.SSS1.p3.1">The qualitative analysis indicates the effectiveness of the locatability quantization network in predicting locatability degrees of street-view images. Furthermore, the prediction aligns with human inference knowledge harvested from real geo-localization games, providing the ground truths for fine-tuning the reasoning component in <em class="ltx_emph ltx_font_italic" id="S4.SS1.SSS1.p3.1.1">GeoReasoner</em>.</p> </div> </section> <section class="ltx_subsubsection" id="S4.SS1.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection">4.1.2 </span>Quantitative Comparison</h4> <div class="ltx_para" id="S4.SS1.SSS2.p1"> <p class="ltx_p" id="S4.SS1.SSS2.p1.1">We conducted quantitative experiments to investigate the importance of using high-locatability GSV images in training the location component in <em class="ltx_emph ltx_font_italic" id="S4.SS1.SSS2.p1.1.1">GeoReasoner</em>. Various datasets were prepared, featuring different proportions of high-locatability GSV images, ranging from 0% (only low-locatability GSV images) to 100% (only high-locatability GSV images). To ensure fairness, each experimental group retained consistent 10K GSV images, with only the proportion of high-locatability images varying. Subsequently, models were trained for each dataset, and their accuracy in country- and city-level geo-localization was evaluated on a randomly sampled set of 1K GSV images. </p> </div> <div class="ltx_para" id="S4.SS1.SSS2.p2"> <p class="ltx_p" id="S4.SS1.SSS2.p2.1">The experimental results are presented in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F6" title="Figure 6 ‣ 4.1.1 Qualitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">6</span></a>. Overall, the results reveal that as the proportion of high-locatability GSV images in the training dataset increases, the performance of the fine-tuned location component improves in both country- and city-level geo-localization. Specifically, the country- and city-level geo-localization accuracy increases from 0.63 & 0.47 for 0% high-locatability GSV images, to 0.72 & 0.51 for 100% high-locatability GSV images. Notably, the experiments only utilize 10K GSV images instead of all the curated 70K high-locatability GSV images due to training complexity. Nevertheless, the results demonstrate that high-locatability GSV images offer more meaningful insights and less extraneous noise, making them highly valuable in the geo-localization task.</p> </div> <figure class="ltx_table" id="S4.T1"> <figcaption class="ltx_caption" style="font-size:80%;"><span class="ltx_tag ltx_tag_table">Table 1: </span>Comparison of Precision, Recall and F1 scores in country-level and city-level geo-localization. * represents the model trained on high-locatability GSV images.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S4.T1.6.6"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S4.T1.6.6.7.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_tt" id="S4.T1.6.6.7.1.1" rowspan="2"><span class="ltx_text" id="S4.T1.6.6.7.1.1.1" style="font-size:80%;">Model</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="3" id="S4.T1.6.6.7.1.2"><span class="ltx_text" id="S4.T1.6.6.7.1.2.1" style="font-size:80%;">Country</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="3" id="S4.T1.6.6.7.1.3"><span class="ltx_text" id="S4.T1.6.6.7.1.3.1" style="font-size:80%;">City</span></th> </tr> <tr class="ltx_tr" id="S4.T1.6.6.6"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.1.1.1.1"> <span class="ltx_text" id="S4.T1.1.1.1.1.1" style="font-size:80%;">Accuracy</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.1.1.1.1.m1.1"><semantics id="S4.T1.1.1.1.1.m1.1a"><mo id="S4.T1.1.1.1.1.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.1.1.1.1.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.1.1.1.1.m1.1b"><ci id="S4.T1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.1.1.1.1.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.1.1.1.1.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.1.1.1.1.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.2.2.2.2"> <span class="ltx_text" id="S4.T1.2.2.2.2.1" style="font-size:80%;">Recall</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.2.2.2.2.m1.1"><semantics id="S4.T1.2.2.2.2.m1.1a"><mo id="S4.T1.2.2.2.2.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.2.2.2.2.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.2.2.2.2.m1.1b"><ci id="S4.T1.2.2.2.2.m1.1.1.cmml" xref="S4.T1.2.2.2.2.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.2.2.2.2.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.2.2.2.2.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.3.3.3.3"> <span class="ltx_text" id="S4.T1.3.3.3.3.1" style="font-size:80%;">F1</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.3.3.3.3.m1.1"><semantics id="S4.T1.3.3.3.3.m1.1a"><mo id="S4.T1.3.3.3.3.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.3.3.3.3.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.3.3.3.3.m1.1b"><ci id="S4.T1.3.3.3.3.m1.1.1.cmml" xref="S4.T1.3.3.3.3.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.3.3.3.3.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.3.3.3.3.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.4.4.4.4"> <span class="ltx_text" id="S4.T1.4.4.4.4.1" style="font-size:80%;">Accuracy</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.4.4.4.4.m1.1"><semantics id="S4.T1.4.4.4.4.m1.1a"><mo id="S4.T1.4.4.4.4.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.4.4.4.4.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.4.4.4.4.m1.1b"><ci id="S4.T1.4.4.4.4.m1.1.1.cmml" xref="S4.T1.4.4.4.4.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.4.4.4.4.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.4.4.4.4.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.5.5.5.5"> <span class="ltx_text" id="S4.T1.5.5.5.5.1" style="font-size:80%;">Recall</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.5.5.5.5.m1.1"><semantics id="S4.T1.5.5.5.5.m1.1a"><mo id="S4.T1.5.5.5.5.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.5.5.5.5.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.5.5.5.5.m1.1b"><ci id="S4.T1.5.5.5.5.m1.1.1.cmml" xref="S4.T1.5.5.5.5.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.5.5.5.5.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.5.5.5.5.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.6.6.6.6"> <span class="ltx_text" id="S4.T1.6.6.6.6.1" style="font-size:80%;">F1</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.6.6.6.6.m1.1"><semantics id="S4.T1.6.6.6.6.m1.1a"><mo id="S4.T1.6.6.6.6.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.6.6.6.6.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.6.6.6.6.m1.1b"><ci id="S4.T1.6.6.6.6.m1.1.1.cmml" xref="S4.T1.6.6.6.6.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.6.6.6.6.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.6.6.6.6.m1.1d">↑</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T1.6.6.8.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T1.6.6.8.1.1"> <span class="ltx_text" id="S4.T1.6.6.8.1.1.1" style="font-size:80%;">StreetCLIP </span><cite class="ltx_cite ltx_citemacro_citep"><span class="ltx_text" id="S4.T1.6.6.8.1.1.2.1" style="font-size:80%;">(</span>Haas et al.<span class="ltx_text" id="S4.T1.6.6.8.1.1.3.2.1.1" style="font-size:80%;">, </span><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib11" title="">2023</a><span class="ltx_text" id="S4.T1.6.6.8.1.1.4.3" style="font-size:80%;">)</span></cite> </th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.2"><span class="ltx_text" id="S4.T1.6.6.8.1.2.1" style="font-size:80%;">0.7943</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.3"><span class="ltx_text" id="S4.T1.6.6.8.1.3.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.4"><span class="ltx_text" id="S4.T1.6.6.8.1.4.1" style="font-size:80%;">0.8854</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.5"><span class="ltx_text" id="S4.T1.6.6.8.1.5.1" style="font-size:80%;">0.7457</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.6"><span class="ltx_text" id="S4.T1.6.6.8.1.6.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.7"><span class="ltx_text" id="S4.T1.6.6.8.1.7.1" style="font-size:80%;">0.8543</span></td> </tr> <tr class="ltx_tr" id="S4.T1.6.6.9.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T1.6.6.9.2.1"> <span class="ltx_text" id="S4.T1.6.6.9.2.1.1" style="font-size:80%;">LLaVA </span><cite class="ltx_cite ltx_citemacro_citep"><span class="ltx_text" id="S4.T1.6.6.9.2.1.2.1" style="font-size:80%;">(</span>Liu et al.<span class="ltx_text" id="S4.T1.6.6.9.2.1.3.2.1.1" style="font-size:80%;">, </span><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib17" title="">2024</a><span class="ltx_text" id="S4.T1.6.6.9.2.1.4.3" style="font-size:80%;">)</span></cite> </th> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.2"><span class="ltx_text" id="S4.T1.6.6.9.2.2.1" style="font-size:80%;">0.4029</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.3"><span class="ltx_text" id="S4.T1.6.6.9.2.3.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.4"><span class="ltx_text" id="S4.T1.6.6.9.2.4.1" style="font-size:80%;">0.5744</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.5"><span class="ltx_text" id="S4.T1.6.6.9.2.5.1" style="font-size:80%;">0.2400</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.6"><span class="ltx_text" id="S4.T1.6.6.9.2.6.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.7"><span class="ltx_text" id="S4.T1.6.6.9.2.7.1" style="font-size:80%;">0.3871</span></td> </tr> <tr class="ltx_tr" id="S4.T1.6.6.10.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T1.6.6.10.3.1"> <span class="ltx_text" id="S4.T1.6.6.10.3.1.1" style="font-size:80%;">Qwen-VL (Qwen-7B) </span><cite class="ltx_cite ltx_citemacro_citep"><span class="ltx_text" id="S4.T1.6.6.10.3.1.2.1" style="font-size:80%;">(</span>Bai et al.<span class="ltx_text" id="S4.T1.6.6.10.3.1.3.2.1.1" style="font-size:80%;">, </span><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib2" title="">2023a</a><span class="ltx_text" id="S4.T1.6.6.10.3.1.4.3" style="font-size:80%;">)</span></cite> </th> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.2"><span class="ltx_text" id="S4.T1.6.6.10.3.2.1" style="font-size:80%;">0.5829</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.3"><span class="ltx_text" id="S4.T1.6.6.10.3.3.1" style="font-size:80%;">0.95</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.4"><span class="ltx_text" id="S4.T1.6.6.10.3.4.1" style="font-size:80%;">0.7225</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.5"><span class="ltx_text" id="S4.T1.6.6.10.3.5.1" style="font-size:80%;">0.3743</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.6"><span class="ltx_text" id="S4.T1.6.6.10.3.6.1" style="font-size:80%;">0.89</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.7"><span class="ltx_text" id="S4.T1.6.6.10.3.7.1" style="font-size:80%;">0.5270</span></td> </tr> <tr class="ltx_tr" id="S4.T1.6.6.11.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T1.6.6.11.4.1"> <span class="ltx_text" id="S4.T1.6.6.11.4.1.1" style="font-size:80%;">GPT-4V </span><cite class="ltx_cite ltx_citemacro_citep"><span class="ltx_text" id="S4.T1.6.6.11.4.1.2.1" style="font-size:80%;">(</span>Achiam et al.<span class="ltx_text" id="S4.T1.6.6.11.4.1.3.2.1.1" style="font-size:80%;">, </span><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib1" title="">2023</a><span class="ltx_text" id="S4.T1.6.6.11.4.1.4.3" style="font-size:80%;">)</span></cite> </th> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.2"><span class="ltx_text" id="S4.T1.6.6.11.4.2.1" style="font-size:80%;">0.8917</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.3"><span class="ltx_text" id="S4.T1.6.6.11.4.3.1" style="font-size:80%;">0.34</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.4"><span class="ltx_text" id="S4.T1.6.6.11.4.4.1" style="font-size:80%;">0.4923</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.5"><span class="ltx_text" id="S4.T1.6.6.11.4.5.1" style="font-size:80%;">0.5083</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.6"><span class="ltx_text" id="S4.T1.6.6.11.4.6.1" style="font-size:80%;">0.31</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.7"><span class="ltx_text" id="S4.T1.6.6.11.4.7.1" style="font-size:80%;">0.3851</span></td> </tr> <tr class="ltx_tr" id="S4.T1.6.6.12.5"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T1.6.6.12.5.1"> <span class="ltx_text" id="S4.T1.6.6.12.5.1.1" style="font-size:80%;">ViT* </span><cite class="ltx_cite ltx_citemacro_citep"><span class="ltx_text" id="S4.T1.6.6.12.5.1.2.1" style="font-size:80%;">(</span>Dosovitskiy et al.<span class="ltx_text" id="S4.T1.6.6.12.5.1.3.2.1.1" style="font-size:80%;">, </span><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib10" title="">2021</a><span class="ltx_text" id="S4.T1.6.6.12.5.1.4.3" style="font-size:80%;">)</span></cite> </th> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.2"><span class="ltx_text" id="S4.T1.6.6.12.5.2.1" style="font-size:80%;">0.7100</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.3"><span class="ltx_text" id="S4.T1.6.6.12.5.3.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.4"><span class="ltx_text" id="S4.T1.6.6.12.5.4.1" style="font-size:80%;">0.8304</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.5"><span class="ltx_text" id="S4.T1.6.6.12.5.5.1" style="font-size:80%;">0.6762</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.6"><span class="ltx_text" id="S4.T1.6.6.12.5.6.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.7"><span class="ltx_text" id="S4.T1.6.6.12.5.7.1" style="font-size:80%;">0.8068</span></td> </tr> <tr class="ltx_tr" id="S4.T1.6.6.13.6"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb" id="S4.T1.6.6.13.6.1"><span class="ltx_text" id="S4.T1.6.6.13.6.1.1" style="font-size:80%;">GeoReasoner*</span></th> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.2"><span class="ltx_text" id="S4.T1.6.6.13.6.2.1" style="font-size:80%;">0.8237</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.3"><span class="ltx_text" id="S4.T1.6.6.13.6.3.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.4"><span class="ltx_text ltx_font_bold" id="S4.T1.6.6.13.6.4.1" style="font-size:80%;">0.9033</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.5"><span class="ltx_text" id="S4.T1.6.6.13.6.5.1" style="font-size:80%;">0.7521</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.6"><span class="ltx_text" id="S4.T1.6.6.13.6.6.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.7"><span class="ltx_text ltx_font_bold" id="S4.T1.6.6.13.6.7.1" style="font-size:80%;">0.8585</span></td> </tr> </tbody> </table> </figure> </section> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.2 </span>Experiments on Geo-localization with Reasoning</h3> <section class="ltx_subsubsection" id="S4.SS2.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection">4.2.1 </span>Qualitative Comparison with SOTA</h4> <div class="ltx_para" id="S4.SS2.SSS1.p1"> <p class="ltx_p" id="S4.SS2.SSS1.p1.1">To assess the efficacy of <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p1.1.1">GeoReasoner</em> in terms of geo-localization with reasoning, we conduct a qualitative comparison with state-of-the-art LVLM-based approaches, including LLaVA <cite class="ltx_cite ltx_citemacro_citep">(Liu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib17" title="">2024</a>)</cite>, Qwen-VL (Qwen-7B) <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib2" title="">2023a</a>)</cite>, and GPT-4V <cite class="ltx_cite ltx_citemacro_citep">(Achiam et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib1" title="">2023</a>)</cite>. In the experimental phase, we presented the same input street-view images, reasoning process, and result formats to these models. Specifically, a consistent prompt is used, as below:</p> </div> <div class="ltx_para" id="S4.SS2.SSS1.p2"> <p class="ltx_p" id="S4.SS2.SSS1.p2.1"><em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p2.1.1">According to the content of the image, please think step by step and deduce in which country and city the image is most likely located and offer possible explanations. Output in JSON format, e.g., <span class="ltx_text ltx_font_upright" id="S4.SS2.SSS1.p2.1.1.1">{</span>‘country’: ‘’, ‘city’: ‘’, ‘reasons’:‘’<span class="ltx_text ltx_font_upright" id="S4.SS2.SSS1.p2.1.1.2">}</span></em>.</p> </div> <div class="ltx_para" id="S4.SS2.SSS1.p3"> <p class="ltx_p" id="S4.SS2.SSS1.p3.1">Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F7" title="Figure 7 ‣ 4.1.1 Qualitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">7</span></a> illustrates the inference results of counterpart models and <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p3.1.1">GeoReasoner</em> on three diverse street views from different countries and cities—namely, Singapore-Singapore (top), United States-Las Vegas (middle) and China-Lhasa (bottom). Overall, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p3.1.2">GeoReasoner</em> not only outperforms existing models in the accuracy of country or city-level predictions but also provides coherent explanations with insightful reasoning for the inference results.</p> </div> <div class="ltx_para" id="S4.SS2.SSS1.p4"> <p class="ltx_p" id="S4.SS2.SSS1.p4.1">In Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F7" title="Figure 7 ‣ 4.1.1 Qualitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">7</span></a> (top), <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p4.1.1">GeoReasoner</em> identifies the word ‘COMFORT’ on the taxi in the image. Drawing from prior knowledge, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p4.1.2">‘the ComfortDelGro taxi is a distinctive symbol of Singapore’s public transportation system’</em> in the text-image pairs, the model deduces that the area is likely to be in <span class="ltx_text ltx_font_italic" id="S4.SS2.SSS1.p4.1.3">Singapore</span>. GPT-4V predicts the same geo-location with accurate reasoning, yet the other two models fail, either due to not recognizing the taxi by LLaVA or making an incorrect inference about the city by Qwen-VL.</p> </div> <div class="ltx_para" id="S4.SS2.SSS1.p5"> <p class="ltx_p" id="S4.SS2.SSS1.p5.1">Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F7" title="Figure 7 ‣ 4.1.1 Qualitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">7</span></a> (middle) presents a scene of the Las Vegas Strip. A conspicuous ‘NEW YORK’ sign is prominently visible in the upper-left corner of the image. This sign causes the reasoning error in the task performed by LLaVA. Although Qwen-VL generates accurate predictions of <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p5.1.1">Las Vegas-United States</em>, the most essential factor, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p5.1.2">i.e.</em>, ‘Las Vegas Strip’, is not considered in the reasoning process. In contrast, both <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p5.1.3">GeoReasoner</em> and GPT-4V provide the correct geo-location along with accurate inference.</p> </div> <div class="ltx_para" id="S4.SS2.SSS1.p6"> <p class="ltx_p" id="S4.SS2.SSS1.p6.1">Based on the depiction of Chinese characters and traditional clothing in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F7" title="Figure 7 ‣ 4.1.1 Qualitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">7</span></a> (right), all models make accurate predictions regarding the country, identifying it as <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p6.1.1">China</em>. However, LLaVA makes an incorrect prediction of the city, specifying <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p6.1.2">Beijing</em>. In contrast, the other models successfully predict the city as <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS1.p6.1.3">Lhasa</em>, providing sensible and justifiable reasons for their inferences.</p> </div> </section> <section class="ltx_subsubsection" id="S4.SS2.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection">4.2.2 </span>Quantitative Comparison with SOTA</h4> <div class="ltx_para" id="S4.SS2.SSS2.p1"> <p class="ltx_p" id="S4.SS2.SSS2.p1.1">We further conduct quantitative experiments to compare with counterparts LVLMs. In addition, we choose StreetCLIP <cite class="ltx_cite ltx_citemacro_citep">(Haas et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib11" title="">2023</a>)</cite> as the state-of-the-art classification-based approach and omit retrieval-based approaches relying on a geo-tagged image gallery that is not available. It is important to clarify that, for the LVLM-based approaches, obtaining corresponding and relevant answers is not guaranteed at all times. Therefore, we included <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p1.1.1">Recall</em> rate to measure the proportion of effective answers within the large language models. When calculating the <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p1.1.2">Accuracy</em> rate, only the accuracy of these effective answers is taken into account. We additionally compute <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p1.1.3">F1</em> values, taking into consideration both <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p1.1.4">Accuracy</em> and <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p1.1.5">Recall</em> metrics.</p> </div> <div class="ltx_para" id="S4.SS2.SSS2.p2"> <p class="ltx_p" id="S4.SS2.SSS2.p2.1">Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T1" title="Table 1 ‣ 4.1.2 Quantitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">1</span></a> presents the prediction results by the counterparts and <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p2.1.1">GeoReasoner</em>. Overall, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p2.1.2">GeoReasoner</em> outperforms all the counterparts, particularly those LVLM-based approaches. Taking the best performed Qwen-VL for example, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p2.1.3">GeoReasoner</em> outperforms it 25.02% on country-level geo-localization and 38.61% on city-level geo-localization, in terms of F1 value. Surprisingly, the recall performance of GPT-4V for the geo-localization task was notably low. Most of the responses were mainly: <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p2.1.4">‘I’m sorry, I can’t provide assistance with that request.’</em> or <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p2.1.5">‘I’m sorry, but I am unable to provide the exact location, such as the country and city, for the image you have provided. My capabilities do not include analyzing specific details to determine the geographical location of the image content.’</em></p> </div> <figure class="ltx_table" id="S4.T2"> <figcaption class="ltx_caption" style="font-size:70%;"><span class="ltx_tag ltx_tag_table">Table 2: </span>Results of the ablation experiments using baseline Qwen-VL (Qwen-7B), GeoReasoner w/o location tuning, GeoReasoner w/o reasoning tuning, and the full GeoReasoner models.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S4.T2.8.8"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S4.T2.8.8.9.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_tt" id="S4.T2.8.8.9.1.1" rowspan="3"><span class="ltx_text" id="S4.T2.8.8.9.1.1.1" style="font-size:80%;">Model</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="2" id="S4.T2.8.8.9.1.2"><span class="ltx_text" id="S4.T2.8.8.9.1.2.1" style="font-size:80%;">Training</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="6" id="S4.T2.8.8.9.1.3"><span class="ltx_text" id="S4.T2.8.8.9.1.3.1" style="font-size:80%;">Performance</span></th> </tr> <tr class="ltx_tr" id="S4.T2.8.8.10.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.8.8.10.2.1" rowspan="2"><span class="ltx_text" id="S4.T2.8.8.10.2.1.1" style="font-size:80%;">Reasoning</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.8.8.10.2.2" rowspan="2"><span class="ltx_text" id="S4.T2.8.8.10.2.2.1" style="font-size:80%;">Location</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" colspan="3" id="S4.T2.8.8.10.2.3"><span class="ltx_text" id="S4.T2.8.8.10.2.3.1" style="font-size:80%;">Country</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" colspan="3" id="S4.T2.8.8.10.2.4"><span class="ltx_text" id="S4.T2.8.8.10.2.4.1" style="font-size:80%;">City</span></th> </tr> <tr class="ltx_tr" id="S4.T2.6.6.6"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.1.1.1.1"> <span class="ltx_text" id="S4.T2.1.1.1.1.1" style="font-size:80%;">Accuracy</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.1.1.1.1.m1.1"><semantics id="S4.T2.1.1.1.1.m1.1a"><mo id="S4.T2.1.1.1.1.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.1.1.1.1.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.1.1.1.1.m1.1b"><ci id="S4.T2.1.1.1.1.m1.1.1.cmml" xref="S4.T2.1.1.1.1.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.1.1.1.1.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.1.1.1.1.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.2.2.2.2"> <span class="ltx_text" id="S4.T2.2.2.2.2.1" style="font-size:80%;">Recall</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.2.2.2.2.m1.1"><semantics id="S4.T2.2.2.2.2.m1.1a"><mo id="S4.T2.2.2.2.2.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.2.2.2.2.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.2.2.2.2.m1.1b"><ci id="S4.T2.2.2.2.2.m1.1.1.cmml" xref="S4.T2.2.2.2.2.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.2.2.2.2.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.2.2.2.2.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.3.3.3.3"> <span class="ltx_text" id="S4.T2.3.3.3.3.1" style="font-size:80%;">F1</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.3.3.3.3.m1.1"><semantics id="S4.T2.3.3.3.3.m1.1a"><mo id="S4.T2.3.3.3.3.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.3.3.3.3.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.3.3.3.3.m1.1b"><ci id="S4.T2.3.3.3.3.m1.1.1.cmml" xref="S4.T2.3.3.3.3.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.3.3.3.3.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.3.3.3.3.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.4.4.4.4"> <span class="ltx_text" id="S4.T2.4.4.4.4.1" style="font-size:80%;">Accuracy</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.4.4.4.4.m1.1"><semantics id="S4.T2.4.4.4.4.m1.1a"><mo id="S4.T2.4.4.4.4.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.4.4.4.4.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.4.4.4.4.m1.1b"><ci id="S4.T2.4.4.4.4.m1.1.1.cmml" xref="S4.T2.4.4.4.4.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.4.4.4.4.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.4.4.4.4.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.5.5.5.5"> <span class="ltx_text" id="S4.T2.5.5.5.5.1" style="font-size:80%;">Recall</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.5.5.5.5.m1.1"><semantics id="S4.T2.5.5.5.5.m1.1a"><mo id="S4.T2.5.5.5.5.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.5.5.5.5.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.5.5.5.5.m1.1b"><ci id="S4.T2.5.5.5.5.m1.1.1.cmml" xref="S4.T2.5.5.5.5.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.5.5.5.5.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.5.5.5.5.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.6.6.6.6"> <span class="ltx_text" id="S4.T2.6.6.6.6.1" style="font-size:80%;">F1</span><math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.6.6.6.6.m1.1"><semantics id="S4.T2.6.6.6.6.m1.1a"><mo id="S4.T2.6.6.6.6.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.6.6.6.6.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.6.6.6.6.m1.1b"><ci id="S4.T2.6.6.6.6.m1.1.1.cmml" xref="S4.T2.6.6.6.6.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.6.6.6.6.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.6.6.6.6.m1.1d">↑</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T2.8.8.11.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T2.8.8.11.1.1"><span class="ltx_text" id="S4.T2.8.8.11.1.1.1" style="font-size:80%;">Qwen-VL (Qwen-7B)</span></th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.2"><span class="ltx_text" id="S4.T2.8.8.11.1.2.1" style="font-size:80%;">-</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.3"><span class="ltx_text" id="S4.T2.8.8.11.1.3.1" style="font-size:80%;">-</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.4"><span class="ltx_text" id="S4.T2.8.8.11.1.4.1" style="font-size:80%;">0.5829</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.5"><span class="ltx_text" id="S4.T2.8.8.11.1.5.1" style="font-size:80%;">0.95</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.6"><span class="ltx_text" id="S4.T2.8.8.11.1.6.1" style="font-size:80%;">0.7225</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.7"><span class="ltx_text" id="S4.T2.8.8.11.1.7.1" style="font-size:80%;">0.3743</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.8"><span class="ltx_text" id="S4.T2.8.8.11.1.8.1" style="font-size:80%;">0.89</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.9"><span class="ltx_text" id="S4.T2.8.8.11.1.9.1" style="font-size:80%;">0.5270</span></td> </tr> <tr class="ltx_tr" id="S4.T2.7.7.7"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T2.7.7.7.2"><span class="ltx_text" id="S4.T2.7.7.7.2.1" style="font-size:80%;">GeoReasoner w/o location tuning</span></th> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.3"><span class="ltx_text" id="S4.T2.7.7.7.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T2.7.7.7.1.m1.1"><semantics id="S4.T2.7.7.7.1.m1.1a"><mo id="S4.T2.7.7.7.1.m1.1.1" mathsize="80%" xref="S4.T2.7.7.7.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T2.7.7.7.1.m1.1b"><times id="S4.T2.7.7.7.1.m1.1.1.cmml" xref="S4.T2.7.7.7.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.7.7.7.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T2.7.7.7.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.4"><span class="ltx_text" id="S4.T2.7.7.7.4.1" style="font-size:80%;">0.6971</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.5"><span class="ltx_text" id="S4.T2.7.7.7.5.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.6"><span class="ltx_text" id="S4.T2.7.7.7.6.1" style="font-size:80%;">0.8215</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.7"><span class="ltx_text" id="S4.T2.7.7.7.7.1" style="font-size:80%;">0.4114</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.8"><span class="ltx_text" id="S4.T2.7.7.7.8.1" style="font-size:80%;">0.99</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.9"><span class="ltx_text" id="S4.T2.7.7.7.9.1" style="font-size:80%;">0.5813</span></td> </tr> <tr class="ltx_tr" id="S4.T2.8.8.8"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T2.8.8.8.2"><span class="ltx_text" id="S4.T2.8.8.8.2.1" style="font-size:80%;">GeoReasoner w/o reasoning tuning</span></th> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T2.8.8.8.1.m1.1"><semantics id="S4.T2.8.8.8.1.m1.1a"><mo id="S4.T2.8.8.8.1.m1.1.1" mathsize="80%" xref="S4.T2.8.8.8.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T2.8.8.8.1.m1.1b"><times id="S4.T2.8.8.8.1.m1.1.1.cmml" xref="S4.T2.8.8.8.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.8.8.8.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T2.8.8.8.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.3"><span class="ltx_text" id="S4.T2.8.8.8.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.4"><span class="ltx_text" id="S4.T2.8.8.8.4.1" style="font-size:80%;">0.7803</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.5"><span class="ltx_text" id="S4.T2.8.8.8.5.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.6"><span class="ltx_text" id="S4.T2.8.8.8.6.1" style="font-size:80%;">0.8766</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.7"><span class="ltx_text" id="S4.T2.8.8.8.7.1" style="font-size:80%;">0.7029</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.8"><span class="ltx_text" id="S4.T2.8.8.8.8.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.9"><span class="ltx_text" id="S4.T2.8.8.8.9.1" style="font-size:80%;">0.8255</span></td> </tr> <tr class="ltx_tr" id="S4.T2.8.8.12.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb" id="S4.T2.8.8.12.2.1"><span class="ltx_text" id="S4.T2.8.8.12.2.1.1" style="font-size:80%;">GeoReasoner</span></th> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.2"><span class="ltx_text" id="S4.T2.8.8.12.2.2.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.3"><span class="ltx_text" id="S4.T2.8.8.12.2.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.4"><span class="ltx_text ltx_font_bold" id="S4.T2.8.8.12.2.4.1" style="font-size:80%;">0.8237</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.5"><span class="ltx_text" id="S4.T2.8.8.12.2.5.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.6"><span class="ltx_text ltx_font_bold" id="S4.T2.8.8.12.2.6.1" style="font-size:80%;">0.9033</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.7"><span class="ltx_text ltx_font_bold" id="S4.T2.8.8.12.2.7.1" style="font-size:80%;">0.7521</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.8"><span class="ltx_text" id="S4.T2.8.8.12.2.8.1" style="font-size:80%;">1.00</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.9"><span class="ltx_text ltx_font_bold" id="S4.T2.8.8.12.2.9.1" style="font-size:80%;">0.8584</span></td> </tr> </tbody> </table> </figure> <div class="ltx_para" id="S4.SS2.SSS2.p3"> <p class="ltx_p" id="S4.SS2.SSS2.p3.1">We speculate that GPT-4V has undergone extensive measures to ensure the model’s security and privacy, which may contribute to its reluctance or denial of recognition in the task of geo-localization.</p> </div> <div class="ltx_para" id="S4.SS2.SSS2.p4"> <p class="ltx_p" id="S4.SS2.SSS2.p4.1">In comparison to StreetCLIP that is specialized in geo-localization, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p4.1.1">GeoReasoner</em> demonstrates only a slight superiority. Nevertheless, it’s important to note that StreetCLIP was trained on a significantly larger dataset of over 1.1 million street-view images, while our <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p4.1.2">GeoReasoner</em> was trained with only 70K street views. For ViT trained on the same data, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p4.1.3">GeoReasoner</em> still exhibits superior geolocation capabilities. Moreover, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS2.p4.1.4">GeoReasoner</em> offers reasoning capability, providing added value for various downstream tasks.</p> </div> <figure class="ltx_table" id="S4.T3"> <figcaption class="ltx_caption" style="font-size:80%;"><span class="ltx_tag ltx_tag_table">Table 3: </span>Comparison results on Im2GPS dataset. The top five rows are derived from the results reported in the paper, while the last four rows are from retesting on the filtered Im2GPS dataset, which includes only highly locatable data.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S4.T3.13.13"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T3.13.13.14.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_tt" id="S4.T3.13.13.14.1.1" rowspan="2"><span class="ltx_text" id="S4.T3.13.13.14.1.1.1" style="font-size:80%;">Model</span></th> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="S4.T3.13.13.14.1.2"><span class="ltx_text" id="S4.T3.13.13.14.1.2.1" style="font-size:80%;">Dataset w/ Filter</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T3.13.13.14.1.3"><span class="ltx_text" id="S4.T3.13.13.14.1.3.1" style="font-size:80%;">Street</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T3.13.13.14.1.4"><span class="ltx_text" id="S4.T3.13.13.14.1.4.1" style="font-size:80%;">City</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T3.13.13.14.1.5"><span class="ltx_text" id="S4.T3.13.13.14.1.5.1" style="font-size:80%;">Country</span></td> </tr> <tr class="ltx_tr" id="S4.T3.13.13.15.2"> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.15.2.1"><span class="ltx_text" id="S4.T3.13.13.15.2.1.1" style="font-size:80%;">Train</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.15.2.2"><span class="ltx_text" id="S4.T3.13.13.15.2.2.1" style="font-size:80%;">Test</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.15.2.3"><span class="ltx_text" id="S4.T3.13.13.15.2.3.1" style="font-size:80%;">1km</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.15.2.4"><span class="ltx_text" id="S4.T3.13.13.15.2.4.1" style="font-size:80%;">25km</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.15.2.5"><span class="ltx_text" id="S4.T3.13.13.15.2.5.1" style="font-size:80%;">750km</span></td> </tr> <tr class="ltx_tr" id="S4.T3.2.2.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T3.2.2.2.3"><span class="ltx_text" id="S4.T3.2.2.2.3.1" style="font-size:80%;">PlaNet</span></th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.1.1.1.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.1.1.1.1.m1.1"><semantics id="S4.T3.1.1.1.1.m1.1a"><mo id="S4.T3.1.1.1.1.m1.1.1" mathsize="80%" xref="S4.T3.1.1.1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.1.1.1.1.m1.1b"><times id="S4.T3.1.1.1.1.m1.1.1.cmml" xref="S4.T3.1.1.1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.1.1.1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.1.1.1.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.2.2.2.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.2.2.2.2.m1.1"><semantics id="S4.T3.2.2.2.2.m1.1a"><mo id="S4.T3.2.2.2.2.m1.1.1" mathsize="80%" xref="S4.T3.2.2.2.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.2.2.2.2.m1.1b"><times id="S4.T3.2.2.2.2.m1.1.1.cmml" xref="S4.T3.2.2.2.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.2.2.2.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.2.2.2.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.2.2.2.4"><span class="ltx_text" id="S4.T3.2.2.2.4.1" style="font-size:80%;">0.08</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.2.2.2.5"><span class="ltx_text" id="S4.T3.2.2.2.5.1" style="font-size:80%;">0.25</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.2.2.2.6"><span class="ltx_text" id="S4.T3.2.2.2.6.1" style="font-size:80%;">0.54</span></td> </tr> <tr class="ltx_tr" id="S4.T3.4.4.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.4.4.4.3"><span class="ltx_text" id="S4.T3.4.4.4.3.1" style="font-size:80%;">CPlaNet</span></th> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.3.3.3.1.m1.1"><semantics id="S4.T3.3.3.3.1.m1.1a"><mo id="S4.T3.3.3.3.1.m1.1.1" mathsize="80%" xref="S4.T3.3.3.3.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.3.3.3.1.m1.1b"><times id="S4.T3.3.3.3.1.m1.1.1.cmml" xref="S4.T3.3.3.3.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.3.3.3.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.3.3.3.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.4.4.4.2.m1.1"><semantics id="S4.T3.4.4.4.2.m1.1a"><mo id="S4.T3.4.4.4.2.m1.1.1" mathsize="80%" xref="S4.T3.4.4.4.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.4.4.4.2.m1.1b"><times id="S4.T3.4.4.4.2.m1.1.1.cmml" xref="S4.T3.4.4.4.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.4.4.4.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.4.4.4.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.4"><span class="ltx_text" id="S4.T3.4.4.4.4.1" style="font-size:80%;">0.17</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.5"><span class="ltx_text" id="S4.T3.4.4.4.5.1" style="font-size:80%;">0.37</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.6"><span class="ltx_text" id="S4.T3.4.4.4.6.1" style="font-size:80%;">0.62</span></td> </tr> <tr class="ltx_tr" id="S4.T3.6.6.6"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.6.6.6.3"><span class="ltx_text" id="S4.T3.6.6.6.3.1" style="font-size:80%;">ISNs</span></th> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.5.5.5.1.m1.1"><semantics id="S4.T3.5.5.5.1.m1.1a"><mo id="S4.T3.5.5.5.1.m1.1.1" mathsize="80%" xref="S4.T3.5.5.5.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.5.5.5.1.m1.1b"><times id="S4.T3.5.5.5.1.m1.1.1.cmml" xref="S4.T3.5.5.5.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.5.5.5.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.5.5.5.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.6.6.6.2.m1.1"><semantics id="S4.T3.6.6.6.2.m1.1a"><mo id="S4.T3.6.6.6.2.m1.1.1" mathsize="80%" xref="S4.T3.6.6.6.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.6.6.6.2.m1.1b"><times id="S4.T3.6.6.6.2.m1.1.1.cmml" xref="S4.T3.6.6.6.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.6.6.6.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.6.6.6.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.4"><span class="ltx_text" id="S4.T3.6.6.6.4.1" style="font-size:80%;">0.17</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.5"><span class="ltx_text" id="S4.T3.6.6.6.5.1" style="font-size:80%;">0.43</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.6"><span class="ltx_text" id="S4.T3.6.6.6.6.1" style="font-size:80%;">0.67</span></td> </tr> <tr class="ltx_tr" id="S4.T3.8.8.8"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.8.8.8.3"><span class="ltx_text" id="S4.T3.8.8.8.3.1" style="font-size:80%;">Translocator</span></th> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.7.7.7.1.m1.1"><semantics id="S4.T3.7.7.7.1.m1.1a"><mo id="S4.T3.7.7.7.1.m1.1.1" mathsize="80%" xref="S4.T3.7.7.7.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.7.7.7.1.m1.1b"><times id="S4.T3.7.7.7.1.m1.1.1.cmml" xref="S4.T3.7.7.7.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.7.7.7.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.7.7.7.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.8.8.8.2.m1.1"><semantics id="S4.T3.8.8.8.2.m1.1a"><mo id="S4.T3.8.8.8.2.m1.1.1" mathsize="80%" xref="S4.T3.8.8.8.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.8.8.8.2.m1.1b"><times id="S4.T3.8.8.8.2.m1.1.1.cmml" xref="S4.T3.8.8.8.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.8.8.8.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.8.8.8.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.4"><span class="ltx_text" id="S4.T3.8.8.8.4.1" style="font-size:80%;">0.20</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.5"><span class="ltx_text" id="S4.T3.8.8.8.5.1" style="font-size:80%;">0.48</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.6"><span class="ltx_text" id="S4.T3.8.8.8.6.1" style="font-size:80%;">0.76</span></td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.10"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.10.10.10.3"><span class="ltx_text" id="S4.T3.10.10.10.3.1" style="font-size:80%;">GeoDecoder</span></th> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.9.9.9.1.m1.1"><semantics id="S4.T3.9.9.9.1.m1.1a"><mo id="S4.T3.9.9.9.1.m1.1.1" mathsize="80%" xref="S4.T3.9.9.9.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.9.9.9.1.m1.1b"><times id="S4.T3.9.9.9.1.m1.1.1.cmml" xref="S4.T3.9.9.9.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.9.9.9.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.9.9.9.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.10.10.10.2.m1.1"><semantics id="S4.T3.10.10.10.2.m1.1a"><mo id="S4.T3.10.10.10.2.m1.1.1" mathsize="80%" xref="S4.T3.10.10.10.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.10.10.10.2.m1.1b"><times id="S4.T3.10.10.10.2.m1.1.1.cmml" xref="S4.T3.10.10.10.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.10.10.10.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.10.10.10.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.4"><span class="ltx_text" id="S4.T3.10.10.10.4.1" style="font-size:80%;">0.22</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.5"><span class="ltx_text" id="S4.T3.10.10.10.5.1" style="font-size:80%;">0.50</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.6"><span class="ltx_text" id="S4.T3.10.10.10.6.1" style="font-size:80%;">0.80</span></td> </tr> <tr class="ltx_tr" id="S4.T3.11.11.11"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T3.11.11.11.2"><span class="ltx_text" id="S4.T3.11.11.11.2.1" style="font-size:80%;">ISNs</span></th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.11.11.11.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.11.11.11.1.m1.1"><semantics id="S4.T3.11.11.11.1.m1.1a"><mo id="S4.T3.11.11.11.1.m1.1.1" mathsize="80%" xref="S4.T3.11.11.11.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.11.11.11.1.m1.1b"><times id="S4.T3.11.11.11.1.m1.1.1.cmml" xref="S4.T3.11.11.11.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.11.11.11.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.11.11.11.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.11.11.11.3"><span class="ltx_text" id="S4.T3.11.11.11.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.11.11.11.4"><span class="ltx_text" id="S4.T3.11.11.11.4.1" style="font-size:80%;">0.25</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.11.11.11.5"><span class="ltx_text" id="S4.T3.11.11.11.5.1" style="font-size:80%;">0.43</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.11.11.11.6"><span class="ltx_text" id="S4.T3.11.11.11.6.1" style="font-size:80%;">0.78</span></td> </tr> <tr class="ltx_tr" id="S4.T3.12.12.12"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.12.12.12.2"><span class="ltx_text" id="S4.T3.12.12.12.2.1" style="font-size:80%;">GeoCLIP</span></th> <td class="ltx_td ltx_align_center" id="S4.T3.12.12.12.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.12.12.12.1.m1.1"><semantics id="S4.T3.12.12.12.1.m1.1a"><mo id="S4.T3.12.12.12.1.m1.1.1" mathsize="80%" xref="S4.T3.12.12.12.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.12.12.12.1.m1.1b"><times id="S4.T3.12.12.12.1.m1.1.1.cmml" xref="S4.T3.12.12.12.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.12.12.12.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.12.12.12.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.12.12.12.3"><span class="ltx_text" id="S4.T3.12.12.12.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.12.12.12.4"><span class="ltx_text" id="S4.T3.12.12.12.4.1" style="font-size:80%;">0.25</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.12.12.12.5"><span class="ltx_text" id="S4.T3.12.12.12.5.1" style="font-size:80%;">0.49</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.12.12.12.6"><span class="ltx_text" id="S4.T3.12.12.12.6.1" style="font-size:80%;">0.87</span></td> </tr> <tr class="ltx_tr" id="S4.T3.13.13.13"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.13.13.13.2"><span class="ltx_text" id="S4.T3.13.13.13.2.1" style="font-size:80%;">GeoReasoner</span></th> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.13.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.13.13.13.1.m1.1"><semantics id="S4.T3.13.13.13.1.m1.1a"><mo id="S4.T3.13.13.13.1.m1.1.1" mathsize="80%" xref="S4.T3.13.13.13.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.13.13.13.1.m1.1b"><times id="S4.T3.13.13.13.1.m1.1.1.cmml" xref="S4.T3.13.13.13.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.13.13.13.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.13.13.13.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.13.3"><span class="ltx_text" id="S4.T3.13.13.13.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.13.4"><span class="ltx_text" id="S4.T3.13.13.13.4.1" style="font-size:80%;">0.10</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.13.5"><span class="ltx_text" id="S4.T3.13.13.13.5.1" style="font-size:80%;">0.41</span></td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.13.6"><span class="ltx_text" id="S4.T3.13.13.13.6.1" style="font-size:80%;">0.82</span></td> </tr> <tr class="ltx_tr" id="S4.T3.13.13.16.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb" id="S4.T3.13.13.16.3.1"><span class="ltx_text" id="S4.T3.13.13.16.3.1.1" style="font-size:80%;">GeoReasoner</span></th> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T3.13.13.16.3.2"><span class="ltx_text" id="S4.T3.13.13.16.3.2.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T3.13.13.16.3.3"><span class="ltx_text" id="S4.T3.13.13.16.3.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T3.13.13.16.3.4"><span class="ltx_text" id="S4.T3.13.13.16.3.4.1" style="font-size:80%;">0.13</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T3.13.13.16.3.5"><span class="ltx_text" id="S4.T3.13.13.16.3.5.1" style="font-size:80%;">0.44</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T3.13.13.16.3.6"><span class="ltx_text" id="S4.T3.13.13.16.3.6.1" style="font-size:80%;">0.86</span></td> </tr> </tbody> </table> </figure> </section> <section class="ltx_subsubsection" id="S4.SS2.SSS3"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection">4.2.3 </span>Ablation Experiments</h4> <div class="ltx_para" id="S4.SS2.SSS3.p1"> <p class="ltx_p" id="S4.SS2.SSS3.p1.1">To assess the contributions of the location tuning and reasoning tuning components in <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS3.p1.1.1">GeoReasoner</em>, we design several ablation experiments using the Qwen-VL <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib2" title="">2023a</a>)</cite> pre-trained model as the baseline. Next, we integrated the Qwen-VL pre-trained model with LoRA1 (<em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS3.p1.1.2">GeoReasoner</em> without location tuning) and LoRA2 (<em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS3.p1.1.3">GeoReasoner</em> without reasoning tuning). The last experiment involved the full <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS3.p1.1.4">GeoReasoner</em> model, including both the location tuning and reasoning tuning components. The same prompts were utilized for all these models, as in the previous experiments.</p> </div> <div class="ltx_para" id="S4.SS2.SSS3.p2"> <p class="ltx_p" id="S4.SS2.SSS3.p2.1">Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T2" title="Table 2 ‣ 4.2.2 Quantitative Comparison with SOTA ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">2</span></a> presents the quantitative results in terms of <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS3.p2.1.1">accuracy</em>, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS3.p2.1.2">recall</em>, and <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS3.p2.1.3">F1</em>. Overall, the results indicate that both the location tuning and reasoning tuning components improve the model performance. Specifically, the location tuning component is essential for geo-localization, as <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS3.p2.1.4">GeoReasoner</em> w/o reasoning tuning (row 3) achieves much higher accuracy than <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS3.p2.1.5">GeoReasoner</em> w/o location tuning (row 2), especially for fine-grained city-level prediction. This result further strengthens the evidence that high-locatability GSV images are essential for geo-localization. The reasoning tuning component also plays a significant role in the performance improvement, as evidenced by the superior performance of the full <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS3.p2.1.6">GeoReasoner</em> (row 4).</p> </div> <figure class="ltx_table" id="S4.T4"> <figcaption class="ltx_caption" style="font-size:80%;"><span class="ltx_tag ltx_tag_table">Table 4: </span>Comparison results on Im2GPS3k dataset. The top six rows are derived from the results reported in the paper, while the last four rows are from retesting on the filtered Im2GPS3k dataset, which includes only highly locatable data.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S4.T4.15.15"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T4.15.15.16.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_tt" id="S4.T4.15.15.16.1.1" rowspan="2"><span class="ltx_text" id="S4.T4.15.15.16.1.1.1" style="font-size:80%;">Model</span></th> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="S4.T4.15.15.16.1.2"><span class="ltx_text" id="S4.T4.15.15.16.1.2.1" style="font-size:80%;">Dataset w/ Filter</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T4.15.15.16.1.3"><span class="ltx_text" id="S4.T4.15.15.16.1.3.1" style="font-size:80%;">Street</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T4.15.15.16.1.4"><span class="ltx_text" id="S4.T4.15.15.16.1.4.1" style="font-size:80%;">City</span></td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T4.15.15.16.1.5"><span class="ltx_text" id="S4.T4.15.15.16.1.5.1" style="font-size:80%;">Country</span></td> </tr> <tr class="ltx_tr" id="S4.T4.15.15.17.2"> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.17.2.1"><span class="ltx_text" id="S4.T4.15.15.17.2.1.1" style="font-size:80%;">Train</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.17.2.2"><span class="ltx_text" id="S4.T4.15.15.17.2.2.1" style="font-size:80%;">Test</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.17.2.3"><span class="ltx_text" id="S4.T4.15.15.17.2.3.1" style="font-size:80%;">1km</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.17.2.4"><span class="ltx_text" id="S4.T4.15.15.17.2.4.1" style="font-size:80%;">25km</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.17.2.5"><span class="ltx_text" id="S4.T4.15.15.17.2.5.1" style="font-size:80%;">750km</span></td> </tr> <tr class="ltx_tr" id="S4.T4.2.2.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T4.2.2.2.3"><span class="ltx_text" id="S4.T4.2.2.2.3.1" style="font-size:80%;">PlaNet</span></th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.1.1.1.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.1.1.1.1.m1.1"><semantics id="S4.T4.1.1.1.1.m1.1a"><mo id="S4.T4.1.1.1.1.m1.1.1" mathsize="80%" xref="S4.T4.1.1.1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.1.1.1.1.m1.1b"><times id="S4.T4.1.1.1.1.m1.1.1.cmml" xref="S4.T4.1.1.1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.1.1.1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.1.1.1.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.2.2.2.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.2.2.2.2.m1.1"><semantics id="S4.T4.2.2.2.2.m1.1a"><mo id="S4.T4.2.2.2.2.m1.1.1" mathsize="80%" xref="S4.T4.2.2.2.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.2.2.2.2.m1.1b"><times id="S4.T4.2.2.2.2.m1.1.1.cmml" xref="S4.T4.2.2.2.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.2.2.2.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.2.2.2.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.2.2.2.4"><span class="ltx_text" id="S4.T4.2.2.2.4.1" style="font-size:80%;">0.09</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.2.2.2.5"><span class="ltx_text" id="S4.T4.2.2.2.5.1" style="font-size:80%;">0.25</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.2.2.2.6"><span class="ltx_text" id="S4.T4.2.2.2.6.1" style="font-size:80%;">0.48</span></td> </tr> <tr class="ltx_tr" id="S4.T4.4.4.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.4.4.4.3"><span class="ltx_text" id="S4.T4.4.4.4.3.1" style="font-size:80%;">CPlaNet</span></th> <td class="ltx_td ltx_align_center" id="S4.T4.3.3.3.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.3.3.3.1.m1.1"><semantics id="S4.T4.3.3.3.1.m1.1a"><mo id="S4.T4.3.3.3.1.m1.1.1" mathsize="80%" xref="S4.T4.3.3.3.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.3.3.3.1.m1.1b"><times id="S4.T4.3.3.3.1.m1.1.1.cmml" xref="S4.T4.3.3.3.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.3.3.3.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.3.3.3.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.4.4.4.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.4.4.4.2.m1.1"><semantics id="S4.T4.4.4.4.2.m1.1a"><mo id="S4.T4.4.4.4.2.m1.1.1" mathsize="80%" xref="S4.T4.4.4.4.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.4.4.4.2.m1.1b"><times id="S4.T4.4.4.4.2.m1.1.1.cmml" xref="S4.T4.4.4.4.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.4.4.4.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.4.4.4.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.4.4.4.4"><span class="ltx_text" id="S4.T4.4.4.4.4.1" style="font-size:80%;">0.10</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.4.4.4.5"><span class="ltx_text" id="S4.T4.4.4.4.5.1" style="font-size:80%;">0.27</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.4.4.4.6"><span class="ltx_text" id="S4.T4.4.4.4.6.1" style="font-size:80%;">0.49</span></td> </tr> <tr class="ltx_tr" id="S4.T4.6.6.6"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.6.6.6.3"><span class="ltx_text" id="S4.T4.6.6.6.3.1" style="font-size:80%;">ISNs</span></th> <td class="ltx_td ltx_align_center" id="S4.T4.5.5.5.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.5.5.5.1.m1.1"><semantics id="S4.T4.5.5.5.1.m1.1a"><mo id="S4.T4.5.5.5.1.m1.1.1" mathsize="80%" xref="S4.T4.5.5.5.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.5.5.5.1.m1.1b"><times id="S4.T4.5.5.5.1.m1.1.1.cmml" xref="S4.T4.5.5.5.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.5.5.5.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.5.5.5.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.6.6.6.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.6.6.6.2.m1.1"><semantics id="S4.T4.6.6.6.2.m1.1a"><mo id="S4.T4.6.6.6.2.m1.1.1" mathsize="80%" xref="S4.T4.6.6.6.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.6.6.6.2.m1.1b"><times id="S4.T4.6.6.6.2.m1.1.1.cmml" xref="S4.T4.6.6.6.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.6.6.6.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.6.6.6.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.6.6.6.4"><span class="ltx_text" id="S4.T4.6.6.6.4.1" style="font-size:80%;">0.11</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.6.6.6.5"><span class="ltx_text" id="S4.T4.6.6.6.5.1" style="font-size:80%;">0.28</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.6.6.6.6"><span class="ltx_text" id="S4.T4.6.6.6.6.1" style="font-size:80%;">0.50</span></td> </tr> <tr class="ltx_tr" id="S4.T4.8.8.8"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.8.8.8.3"><span class="ltx_text" id="S4.T4.8.8.8.3.1" style="font-size:80%;">Translocator</span></th> <td class="ltx_td ltx_align_center" id="S4.T4.7.7.7.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.7.7.7.1.m1.1"><semantics id="S4.T4.7.7.7.1.m1.1a"><mo id="S4.T4.7.7.7.1.m1.1.1" mathsize="80%" xref="S4.T4.7.7.7.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.7.7.7.1.m1.1b"><times id="S4.T4.7.7.7.1.m1.1.1.cmml" xref="S4.T4.7.7.7.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.7.7.7.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.7.7.7.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.8.8.8.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.8.8.8.2.m1.1"><semantics id="S4.T4.8.8.8.2.m1.1a"><mo id="S4.T4.8.8.8.2.m1.1.1" mathsize="80%" xref="S4.T4.8.8.8.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.8.8.8.2.m1.1b"><times id="S4.T4.8.8.8.2.m1.1.1.cmml" xref="S4.T4.8.8.8.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.8.8.8.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.8.8.8.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.8.8.8.4"><span class="ltx_text" id="S4.T4.8.8.8.4.1" style="font-size:80%;">0.12</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.8.8.8.5"><span class="ltx_text" id="S4.T4.8.8.8.5.1" style="font-size:80%;">0.31</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.8.8.8.6"><span class="ltx_text" id="S4.T4.8.8.8.6.1" style="font-size:80%;">0.59</span></td> </tr> <tr class="ltx_tr" id="S4.T4.10.10.10"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.10.10.10.3"><span class="ltx_text" id="S4.T4.10.10.10.3.1" style="font-size:80%;">GeoDecoder</span></th> <td class="ltx_td ltx_align_center" id="S4.T4.9.9.9.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.9.9.9.1.m1.1"><semantics id="S4.T4.9.9.9.1.m1.1a"><mo id="S4.T4.9.9.9.1.m1.1.1" mathsize="80%" xref="S4.T4.9.9.9.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.9.9.9.1.m1.1b"><times id="S4.T4.9.9.9.1.m1.1.1.cmml" xref="S4.T4.9.9.9.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.9.9.9.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.9.9.9.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.10.10.10.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.10.10.10.2.m1.1"><semantics id="S4.T4.10.10.10.2.m1.1a"><mo id="S4.T4.10.10.10.2.m1.1.1" mathsize="80%" xref="S4.T4.10.10.10.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.10.10.10.2.m1.1b"><times id="S4.T4.10.10.10.2.m1.1.1.cmml" xref="S4.T4.10.10.10.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.10.10.10.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.10.10.10.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.10.10.10.4"><span class="ltx_text" id="S4.T4.10.10.10.4.1" style="font-size:80%;">0.13</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.10.10.10.5"><span class="ltx_text" id="S4.T4.10.10.10.5.1" style="font-size:80%;">0.34</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.10.10.10.6"><span class="ltx_text" id="S4.T4.10.10.10.6.1" style="font-size:80%;">0.61</span></td> </tr> <tr class="ltx_tr" id="S4.T4.12.12.12"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.12.12.12.3"><span class="ltx_text" id="S4.T4.12.12.12.3.1" style="font-size:80%;">GeoCLIP</span></th> <td class="ltx_td ltx_align_center" id="S4.T4.11.11.11.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.11.11.11.1.m1.1"><semantics id="S4.T4.11.11.11.1.m1.1a"><mo id="S4.T4.11.11.11.1.m1.1.1" mathsize="80%" xref="S4.T4.11.11.11.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.11.11.11.1.m1.1b"><times id="S4.T4.11.11.11.1.m1.1.1.cmml" xref="S4.T4.11.11.11.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.11.11.11.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.11.11.11.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.12.12.12.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.12.12.12.2.m1.1"><semantics id="S4.T4.12.12.12.2.m1.1a"><mo id="S4.T4.12.12.12.2.m1.1.1" mathsize="80%" xref="S4.T4.12.12.12.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.12.12.12.2.m1.1b"><times id="S4.T4.12.12.12.2.m1.1.1.cmml" xref="S4.T4.12.12.12.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.12.12.12.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.12.12.12.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.12.12.12.4"><span class="ltx_text" id="S4.T4.12.12.12.4.1" style="font-size:80%;">0.14</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.12.12.12.5"><span class="ltx_text" id="S4.T4.12.12.12.5.1" style="font-size:80%;">0.34</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.12.12.12.6"><span class="ltx_text" id="S4.T4.12.12.12.6.1" style="font-size:80%;">0.70</span></td> </tr> <tr class="ltx_tr" id="S4.T4.13.13.13"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T4.13.13.13.2"><span class="ltx_text" id="S4.T4.13.13.13.2.1" style="font-size:80%;">ISNs</span></th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.13.13.13.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.13.13.13.1.m1.1"><semantics id="S4.T4.13.13.13.1.m1.1a"><mo id="S4.T4.13.13.13.1.m1.1.1" mathsize="80%" xref="S4.T4.13.13.13.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.13.13.13.1.m1.1b"><times id="S4.T4.13.13.13.1.m1.1.1.cmml" xref="S4.T4.13.13.13.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.13.13.13.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.13.13.13.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.13.13.13.3"><span class="ltx_text" id="S4.T4.13.13.13.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.13.13.13.4"><span class="ltx_text" id="S4.T4.13.13.13.4.1" style="font-size:80%;">0.10</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.13.13.13.5"><span class="ltx_text" id="S4.T4.13.13.13.5.1" style="font-size:80%;">0.29</span></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.13.13.13.6"><span class="ltx_text" id="S4.T4.13.13.13.6.1" style="font-size:80%;">0.59</span></td> </tr> <tr class="ltx_tr" id="S4.T4.14.14.14"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.14.14.14.2"><span class="ltx_text" id="S4.T4.14.14.14.2.1" style="font-size:80%;">GeoCLIP</span></th> <td class="ltx_td ltx_align_center" id="S4.T4.14.14.14.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.14.14.14.1.m1.1"><semantics id="S4.T4.14.14.14.1.m1.1a"><mo id="S4.T4.14.14.14.1.m1.1.1" mathsize="80%" xref="S4.T4.14.14.14.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.14.14.14.1.m1.1b"><times id="S4.T4.14.14.14.1.m1.1.1.cmml" xref="S4.T4.14.14.14.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.14.14.14.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.14.14.14.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.14.14.14.3"><span class="ltx_text" id="S4.T4.14.14.14.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.14.14.14.4"><span class="ltx_text" id="S4.T4.14.14.14.4.1" style="font-size:80%;">0.12</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.14.14.14.5"><span class="ltx_text" id="S4.T4.14.14.14.5.1" style="font-size:80%;">0.38</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.14.14.14.6"><span class="ltx_text" id="S4.T4.14.14.14.6.1" style="font-size:80%;">0.83</span></td> </tr> <tr class="ltx_tr" id="S4.T4.15.15.15"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.15.15.15.2"><span class="ltx_text" id="S4.T4.15.15.15.2.1" style="font-size:80%;">GeoReasoner</span></th> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.15.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.15.15.15.1.m1.1"><semantics id="S4.T4.15.15.15.1.m1.1a"><mo id="S4.T4.15.15.15.1.m1.1.1" mathsize="80%" xref="S4.T4.15.15.15.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.15.15.15.1.m1.1b"><times id="S4.T4.15.15.15.1.m1.1.1.cmml" xref="S4.T4.15.15.15.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.15.15.15.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.15.15.15.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.15.3"><span class="ltx_text" id="S4.T4.15.15.15.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.15.4"><span class="ltx_text" id="S4.T4.15.15.15.4.1" style="font-size:80%;">0.09</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.15.5"><span class="ltx_text" id="S4.T4.15.15.15.5.1" style="font-size:80%;">0.35</span></td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.15.6"><span class="ltx_text" id="S4.T4.15.15.15.6.1" style="font-size:80%;">0.74</span></td> </tr> <tr class="ltx_tr" id="S4.T4.15.15.18.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb" id="S4.T4.15.15.18.3.1"><span class="ltx_text" id="S4.T4.15.15.18.3.1.1" style="font-size:80%;">GeoReasoner</span></th> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T4.15.15.18.3.2"><span class="ltx_text" id="S4.T4.15.15.18.3.2.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T4.15.15.18.3.3"><span class="ltx_text" id="S4.T4.15.15.18.3.3.1" style="font-size:80%;">✓</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T4.15.15.18.3.4"><span class="ltx_text" id="S4.T4.15.15.18.3.4.1" style="font-size:80%;">0.10</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T4.15.15.18.3.5"><span class="ltx_text" id="S4.T4.15.15.18.3.5.1" style="font-size:80%;">0.38</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T4.15.15.18.3.6"><span class="ltx_text" id="S4.T4.15.15.18.3.6.1" style="font-size:80%;">0.83</span></td> </tr> </tbody> </table> </figure> </section> <section class="ltx_subsubsection" id="S4.SS2.SSS4"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection">4.2.4 </span>Generalizability Evaluation</h4> <div class="ltx_para" id="S4.SS2.SSS4.p1"> <p class="ltx_p" id="S4.SS2.SSS4.p1.1">To further assess the generalizability of <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS4.p1.1.1">Georeasoner</em> in geo-localization, we conduct additional testing on open Flickr image datasets of Im2GPS <cite class="ltx_cite ltx_citemacro_citep">(Hays & Efros, <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib12" title="">2008</a>)</cite> and Im2GPS3k <cite class="ltx_cite ltx_citemacro_citep">(Vo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib30" title="">2017</a>)</cite>. Here we use only 10k Flickr images for fine-tuning <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS4.p1.1.2">Georeasoner</em>. Since <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS4.p1.1.3">Georeasoner</em> predicts city names rather than GPS coordinates, we first convert the predicted city names generated by <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS4.p1.1.4">Georeasoner</em> into the GPS coordinates of their respective city centers, then measure the distance between these predicted coordinates and ground-truth locations.</p> </div> <div class="ltx_para" id="S4.SS2.SSS4.p2"> <p class="ltx_p" id="S4.SS2.SSS4.p2.1">Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T3" title="Table 3 ‣ 4.2.2 Quantitative Comparison with SOTA ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">3</span></a> and Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T4" title="Table 4 ‣ 4.2.3 Ablation Experiments ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">4</span></a> present the performance comparison of <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS4.p2.1.1">Georeasoner</em> with PlaNet <cite class="ltx_cite ltx_citemacro_citep">(Weyand et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib32" title="">2016</a>)</cite>, CPlaNet <cite class="ltx_cite ltx_citemacro_citep">(Seo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib25" title="">2018</a>)</cite>, ISNs <cite class="ltx_cite ltx_citemacro_citep">(Müller-Budack et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib19" title="">2018</a>)</cite>, Translocator <cite class="ltx_cite ltx_citemacro_citep">(Pramanick et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib20" title="">2022</a>)</cite>, GeoDecoder <cite class="ltx_cite ltx_citemacro_citep">(Clark et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib8" title="">2023</a>)</cite>, and GeoCLIP <cite class="ltx_cite ltx_citemacro_citep">(Vivanco Cepeda et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib29" title="">2024</a>)</cite> on Im2GPS and Im2GPS3k datasets, respectively. The results demonstrate that fine-tuning <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS4.p2.1.2">GeoReasoner</em> using highly locatable images significantly improves prediction accuracy for street, city, and country levels (row 8 vs. row 9 in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T3" title="Table 3 ‣ 4.2.2 Quantitative Comparison with SOTA ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">3</span></a>, and row 9 vs. row 10 in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T4" title="Table 4 ‣ 4.2.3 Ablation Experiments ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">4</span></a>). Remarkably, despite being fine-tuned solely on a smaller number of Flickr images, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS4.p2.1.3">GeoReasoner</em> achieves results comparable to ISNs and GeoCLIP trained on millions of Flickr images, particularly in terms of city- and country-level accuracy. Besides, <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS4.p2.1.4">GeoReasoner</em> trained on the filtered, highly locatable Flickr images also show improvements in the city- and country-level geo-localization, demonstrating the generalizability of our proposed <em class="ltx_emph ltx_font_italic" id="S4.SS2.SSS4.p2.1.5">locatability</em> module.</p> </div> </section> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">5 </span>Discussion</h2> <div class="ltx_para" id="S5.p1"> <p class="ltx_p" id="S5.p1.1"><span class="ltx_text ltx_font_bold" id="S5.p1.1.1">The significance of high-locatability street-view images.</span> We observe a significant performance improvement when <em class="ltx_emph ltx_font_italic" id="S5.p1.1.2">GeoReasoner</em> is trained upon high-locatability street-view images. Such images often contain explicit visual clues such as stylized architecture, traffic signs, and landmarks, providing the model with richer contextual information. Therefore, increasing the quality of the training dataset enhances the model’s geo-localization performance. Additionally, the quantity of high-locatability images is vital, as the model trained with 70K images (as in Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2.SSS2" title="4.2.2 Quantitative Comparison with SOTA ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">4.2.2</span></a>) achieves significantly higher accuracy than the one trained with 10K images (Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS1.SSS2" title="4.1.2 Quantitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">4.1.2</span></a>). In balancing the quality and quantity of the training dataset, we empirically applied a threshold of 0.4 to differentiate between highly and less localizable street views. Setting the threshold too high (<em class="ltx_emph ltx_font_italic" id="S5.p1.1.3">e.g.</em>, 0.7) can lead to a notable decrease in the amount of high-locatability images, whilst a lower threshold (<em class="ltx_emph ltx_font_italic" id="S5.p1.1.4">e.g.</em>, 0.1) may bring in introduce low-quality images.</p> </div> <div class="ltx_para" id="S5.p2"> <p class="ltx_p" id="S5.p2.1"><span class="ltx_text ltx_font_bold" id="S5.p2.1.1">The necessity of reasoning process.</span> The introduction of the reasoning component successfully elevated <em class="ltx_emph ltx_font_italic" id="S5.p2.1.2">GeoReasoner</em>’s performance in the geo-localization task. This signifies that LVLM can adeptly capture intricate relationships among image features, location clues, and geo-locations in the training process. Implemented an innovative solution to empower the reasoning capability within <em class="ltx_emph ltx_font_italic" id="S5.p2.1.3">GeoReasoner</em> by leveraging human inference knowledge extracted from geo-localization games. Despite the relatively small dataset, a noticeable improvement in performance has been achieved. In the future, we plan to expand the reasoning dataset by diversifying the influencing clues. For instance, the current textual clues are absent of landscape information, which could provide invaluable insights for geo-localization. We will collaborate with domain experts such as urban planners and geographers to address these limitations.</p> </div> <div class="ltx_para" id="S5.p3"> <p class="ltx_p" id="S5.p3.1"><span class="ltx_text ltx_font_bold" id="S5.p3.1.1">Failure cases.</span> <em class="ltx_emph ltx_font_italic" id="S5.p3.1.2">GeoReasoner</em> comprehends architectural style as a pivotal factor in geo-localization. However, the model can be misled by the learned significance of architectural style. Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S5.F8" title="Figure 8 ‣ 5 Discussion ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">8</span></a> presents a street view of the Eiffel Tower in Paris, France (left), and replicas of the Eiffel Tower in New York, USA (middle) and in Hangzhou, China (right). <em class="ltx_emph ltx_font_italic" id="S5.p3.1.3">GeoReasoner</em> fails to distinguish between them, predicting all instances as located in Paris, France. This misclassification is not unique to <em class="ltx_emph ltx_font_italic" id="S5.p3.1.4">GeoReasoner</em> but also extends to other LVLMs like GPT-4V. Consequently, it underscores the necessity for LVLM-based methods to delve deeper into knowledge for more sophisticated geo-localization capabilities. Once again, it is imperative to collaborate with domain experts and enhance the visual clues and reasoning procedure comprehensively to tackle this issue.</p> </div> <figure class="ltx_figure" id="S5.F8"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="407" id="S5.F8.1.g1" src="x7.png" width="821"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 8: </span>GeoReasoner fails to distinguish the Eiffel Tower and its replicas in New York, USA, and Hangzhou, China.</figcaption> </figure> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">6 </span>Conclusion</h2> <div class="ltx_para" id="S6.p1"> <p class="ltx_p" id="S6.p1.1">In this paper, we present a new paradigm that integrates a large vision-language (LVLM) model with human inference knowledge for street view geo-localization with reasoning. We introduce the concept of <em class="ltx_emph ltx_font_italic" id="S6.p1.1.1">locatability</em> and devise a CLIP-based network to quantify the degree of locatability in street-view images, facilitating the selection of high-quality data. We design an LVLM-based model named <em class="ltx_emph ltx_font_italic" id="S6.p1.1.2">GeoReasoner</em>, which harnesses external knowledge of human inference from real geo-localization games and curated high-quality data to enhance the performance of geo-localization tasks with reasoning capabilities. The model undergoes two-stage fine-tuning, namely <em class="ltx_emph ltx_font_italic" id="S6.p1.1.3">reasoning tuning</em> and <em class="ltx_emph ltx_font_italic" id="S6.p1.1.4">location tuning</em>. The <em class="ltx_emph ltx_font_italic" id="S6.p1.1.5">reasoning tuning</em> stage aims to acquire potential linkage between coarse-grained geographical locations (<em class="ltx_emph ltx_font_italic" id="S6.p1.1.6">i.e.</em>, country) and the associated positioning reasons. In <em class="ltx_emph ltx_font_italic" id="S6.p1.1.7">location tuning</em> stage, we employ the curated high-quality data to further refine the model in fine-grained geo-localization (<em class="ltx_emph ltx_font_italic" id="S6.p1.1.8">i.e.</em>, city) learning. Extensive experiments prove that <em class="ltx_emph ltx_font_italic" id="S6.p1.1.9">GeoReasoner</em> outperforms previous models qualitatively and quantitatively.</p> </div> </section> <section class="ltx_section" id="Sx1"> <h2 class="ltx_title ltx_title_section">Acknowledgements</h2> <div class="ltx_para" id="Sx1.p1"> <p class="ltx_p" id="Sx1.p1.1">We would like to thank Yao Zhou and Wenqi Shao for their insightful discussions and Ziyao Gao for her assistance in drawing the figures in this paper. We also extend our gratitude to the anonymous reviewers for their valuable comments. This work is partially supported by the National Natural Science Foundation of China (62172398, 42171456, 52078343).</p> </div> </section> <section class="ltx_section" id="Sx2"> <h2 class="ltx_title ltx_title_section">Impact Statement</h2> <div class="ltx_para" id="Sx2.p1"> <p class="ltx_p" id="Sx2.p1.1"><em class="ltx_emph ltx_font_italic" id="Sx2.p1.1.1">GeoReasoner</em> advances image-based geo-localization technologies that are pivotal for many applications such as autonomous navigation. The pipeline of constructing the dataset featuring high-locatability street views proves highly beneficial across multiple scenarios, such as urban studies, culture studies, and digital humanities, all of which are increasingly reliant on the analysis of high-quality street-view data.</p> </div> <div class="ltx_para" id="Sx2.p2"> <p class="ltx_p" id="Sx2.p2.1">The proposed paradigm represents the fusion of LVLM with human inference knowledge, which has implications for the advancement of artificial intelligence (AI) that is more aligned with human cognition. The synergy can lead to the creation of AI that is not only more effective in complex inference tasks but also more understandable and relatable to human users. As AI becomes more pervasive in daily life, the importance of designing systems that are both transparent and capable of complex reasoning cannot be overstated.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Achiam et al. (2023)</span> <span class="ltx_bibblock"> Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. </span> <span class="ltx_bibblock">GPT-4 Technical Report. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib1.1.1">arXiv preprint arXiv:2303.08774</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bai et al. (2023a)</span> <span class="ltx_bibblock"> Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. </span> <span class="ltx_bibblock">Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib2.1.1">arXiv preprint arXiv:2308.12966</em>, 2023a. </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bai et al. (2023b)</span> <span class="ltx_bibblock"> Bai, Y., Shang, C., Li, Y., Shen, L., Jin, S., and Shen, Q. </span> <span class="ltx_bibblock">Transport Object Detection in Street View Imagery Using Decomposed Convolutional Neural Networks. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib3.1.1">Mathematics</em>, 11(18):3839, 2023b. </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Campbell et al. (2019)</span> <span class="ltx_bibblock"> Campbell, A., Both, A., and Sun, Q. C. </span> <span class="ltx_bibblock">Detecting and mapping traffic signs from Google Street View images using deep learning and GIS. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib4.1.1">Computers, Environment and Urban Systems</em>, 77:101350, 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chalvatzaras et al. (2022)</span> <span class="ltx_bibblock"> Chalvatzaras, A., Pratikakis, I., and Amanatiadis, A. A. </span> <span class="ltx_bibblock">A Survey on Map-Based Localization Techniques for Autonomous Vehicles. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib5.1.1">IEEE Transactions on Intelligent Vehicles</em>, 8(2):1574–1596, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cheng et al. (2021)</span> <span class="ltx_bibblock"> Cheng, B., Schwing, A., and Kirillov, A. </span> <span class="ltx_bibblock">Per-Pixel Classification is Not All You Need for Semantic Segmentation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib6.1.1">Advances in Neural Information Processing Systems</em>, 34:17864–17875, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cheng et al. (2022)</span> <span class="ltx_bibblock"> Cheng, W., Wen, R., Huang, H., Miao, W., and Wang, C. </span> <span class="ltx_bibblock">OPTDP: Towards optimal personalized trajectory differential privacy for trajectory data publishing. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib7.1.1">Neurocomputing</em>, 472:201–211, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Clark et al. (2023)</span> <span class="ltx_bibblock"> Clark, B., Kerrigan, A., Kulkarni, P. P., Cepeda, V. V., and Shah, M. </span> <span class="ltx_bibblock">Where We Are and What We’re Looking At: Query Based Worldwide Image Geo-Localization Using Hierarchies and Scenes. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib8.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp. 23182–23190, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dai et al. (2023)</span> <span class="ltx_bibblock"> Dai, D., Sun, Y., Dong, L., Hao, Y., Ma, S., Sui, Z., and Wei, F. </span> <span class="ltx_bibblock">Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib9.1.1">Findings of the Association for Computational Linguistics: ACL 2023</em>, pp. 4005–4019, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dosovitskiy et al. (2021)</span> <span class="ltx_bibblock"> Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. </span> <span class="ltx_bibblock">An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib10.1.1">Proceedings of International Conference on Learning Representations</em>, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Haas et al. (2023)</span> <span class="ltx_bibblock"> Haas, L., Alberti, S., and Skreta, M. </span> <span class="ltx_bibblock">Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib11.1.1">arXiv preprint arXiv:2302.00275</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hays & Efros (2008)</span> <span class="ltx_bibblock"> Hays, J. and Efros, A. A. </span> <span class="ltx_bibblock">IM2GPS: estimating geographic information from a single image. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib12.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp. 1–8. IEEE, 2008. </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hu et al. (2022)</span> <span class="ltx_bibblock"> Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. </span> <span class="ltx_bibblock">LoRA: Low-Rank Adaptation of Large Language Models. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib13.1.1">Proceedings of International Conference on Learning Representations</em>, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kenton & Toutanova (2019)</span> <span class="ltx_bibblock"> Kenton, J. D. M.-W. C. and Toutanova, L. K. </span> <span class="ltx_bibblock">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib14.1.1">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</em>, pp. 4171–4186, 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2021)</span> <span class="ltx_bibblock"> Li, Y., Wu, C., Li, L., Liu, Y., and Zhu, J. </span> <span class="ltx_bibblock">Caption Generation From Road Images for Traffic Scene Modeling. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib15.1.1">IEEE Transactions on Intelligent Transportation Systems</em>, 23(7):7805–7816, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lin et al. (2022)</span> <span class="ltx_bibblock"> Lin, J., Zheng, Z., Zhong, Z., Luo, Z., Li, S., Yang, Y., and Sebe, N. </span> <span class="ltx_bibblock">Joint Representation Learning and Keypoint Detection for Cross-View Geo-Localization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib16.1.1">IEEE Transactions on Image Processing</em>, 31:3780–3792, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al. (2024)</span> <span class="ltx_bibblock"> Liu, H., Li, C., Wu, Q., and Lee, Y. J. </span> <span class="ltx_bibblock">Visual Instruction Tuning. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib17.1.1">Advances in Neural Information Processing Systems</em>, 36, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Luo et al. (2022)</span> <span class="ltx_bibblock"> Luo, G., Biamby, G., Darrell, T., Fried, D., and Rohrbach, A. </span> <span class="ltx_bibblock">G3: Geolocation via guidebook grounding. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib18.1.1">Findings of the Association for Computational Linguistics: EMNLP 2022</em>, pp. 5841–5853, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Müller-Budack et al. (2018)</span> <span class="ltx_bibblock"> Müller-Budack, E., Pustu-Iren, K., and Ewerth, R. </span> <span class="ltx_bibblock">Geolocation Estimation of Photos using a Hierarchical Model and Scene Classification. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib19.1.1">Proceedings of the European Conference on Computer Vision</em>, pp. 563–579, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Pramanick et al. (2022)</span> <span class="ltx_bibblock"> Pramanick, S., Nowara, E. M., Gleason, J., Castillo, C. D., and Chellappa, R. </span> <span class="ltx_bibblock">Where in the World is this Image? Transformer-based Geo-localization in the Wild. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib20.1.1">Proceedings of the European Conference on Computer Vision</em>, pp. 196–215, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Qiao et al. (2023)</span> <span class="ltx_bibblock"> Qiao, S., Ou, Y., Zhang, N., Chen, X., Yao, Y., Deng, S., Tan, C., Huang, F., and Chen, H. </span> <span class="ltx_bibblock">Reasoning with Language Model Prompting: A Survey. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib21.1.1">Proceedings of the Annual Meeting of the Association for Computational Linguistics</em>, pp. 5368–5393, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Radford et al. (2021)</span> <span class="ltx_bibblock"> Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. </span> <span class="ltx_bibblock">Learning Transferable Visual Models From Natural Language Supervision. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib22.1.1">Proceedings of International Conference on Machine Learning</em>, pp. 8748–8763. PMLR, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Rao et al. (2023)</span> <span class="ltx_bibblock"> Rao, J., Shan, Z., Liu, L., Zhou, Y., and Yang, Y. </span> <span class="ltx_bibblock">Retrieval-based Knowledge Augmented Vision Language Pre-training. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib23.1.1">Proceedings of the 31st ACM International Conference on Multimedia</em>, pp. 5399–5409, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Reimers & Gurevych (2019)</span> <span class="ltx_bibblock"> Reimers, N. and Gurevych, I. </span> <span class="ltx_bibblock">Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib24.1.1">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing</em>, pp. 3980–3990, 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Seo et al. (2018)</span> <span class="ltx_bibblock"> Seo, P. H., Weyand, T., Sim, J., and Han, B. </span> <span class="ltx_bibblock">CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib25.1.1">Proceedings of the European Conference on Computer Vision</em>, pp. 536–551, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Shao et al. (2023)</span> <span class="ltx_bibblock"> Shao, Z., Yu, Z., Wang, M., and Yu, J. </span> <span class="ltx_bibblock">Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib26.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp. 14974–14983, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Shen et al. (2018)</span> <span class="ltx_bibblock"> Shen, Q., Zeng, W., Ye, Y., Mueller Arisona, S., Schubiger, S., Burkhard, R., and Qu, H. </span> <span class="ltx_bibblock">StreetVizor: Visual Exploration of Human-Scale Urban Forms Based on Street Views. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib27.1.1">IEEE Transactions on Visualization and Computer Graphics</em>, 24(1):1004–1013, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Theiner et al. (2022)</span> <span class="ltx_bibblock"> Theiner, J., Müller-Budack, E., and Ewerth, R. </span> <span class="ltx_bibblock">Interpretable Semantic Photo Geolocation. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib28.1.1">Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</em>, pp. 750–760, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Vivanco Cepeda et al. (2024)</span> <span class="ltx_bibblock"> Vivanco Cepeda, V., Nayak, G. K., and Shah, M. </span> <span class="ltx_bibblock">GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib29.1.1">Advances in Neural Information Processing Systems</em>, 36, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Vo et al. (2017)</span> <span class="ltx_bibblock"> Vo, N., Jacobs, N., and Hays, J. </span> <span class="ltx_bibblock">Revisiting IM2GPS in the Deep Learning Era. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib30.1.1">Proceedings of the IEEE International Conference on Computer Vision</em>, pp. 2621–2630, 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wei et al. (2022)</span> <span class="ltx_bibblock"> Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. </span> <span class="ltx_bibblock">Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib31.1.1">Advances in Neural Information Processing Systems</em>, 35:24824–24837, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Weyand et al. (2016)</span> <span class="ltx_bibblock"> Weyand, T., Kostrikov, I., and Philbin, J. </span> <span class="ltx_bibblock">PlaNet - Photo Geolocation with Convolutional Neural Networks. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib32.1.1">Proceedings of the European Conference on Computer Vision</em>, pp. 37–55, 2016. </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Xu et al. (2023)</span> <span class="ltx_bibblock"> Xu, P., Shao, W., Zhang, K., Gao, P., Liu, S., Lei, M., Meng, F., Huang, S., Qiao, Y., and Luo, P. </span> <span class="ltx_bibblock">LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib33.1.1">arXiv preprint arXiv:2306.09265</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yao et al. (2024)</span> <span class="ltx_bibblock"> Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. </span> <span class="ltx_bibblock">Tree of Thoughts: Deliberate Problem Solving with Large Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib34.1.1">Advances in Neural Information Processing Systems</em>, 36, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ye et al. (2019a)</span> <span class="ltx_bibblock"> Ye, Y., Richards, D., Lu, Y., Song, X., Zhuang, Y., Zeng, W., and Zhong, T. </span> <span class="ltx_bibblock">Measuring daily accessed street greenery: A human-scale approach for informing better urban planning practices. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib35.1.1">Landscape and Urban Planning</em>, 191:103434, 2019a. </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ye et al. (2019b)</span> <span class="ltx_bibblock"> Ye, Y., Zeng, W., Shen, Q., Zhang, X., and Lu, Y. </span> <span class="ltx_bibblock">The visual quality of streets: A human-centred continuous measurement based on machine learning algorithms and street view images. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib36.1.1">Environment and Planning B: Urban Analytics and City Science</em>, 46(8):1439–1457, 2019b. </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ying et al. (2024)</span> <span class="ltx_bibblock"> Ying, K., Meng, F., Wang, J., Li, Z., Lin, H., Yang, Y., Zhang, H., Zhang, W., Lin, Y., Liu, S., et al. </span> <span class="ltx_bibblock">MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib37.1.1">arXiv preprint arXiv:2404.16006</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2018)</span> <span class="ltx_bibblock"> Zhang, F., Zhang, D., Liu, Y., and Lin, H. </span> <span class="ltx_bibblock">Representing place locales using scene elements. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib38.1.1">Computers, Environment and Urban Systems</em>, 71:153–164, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2023a)</span> <span class="ltx_bibblock"> Zhang, H., Song, H., Li, S., Zhou, M., and Song, D. </span> <span class="ltx_bibblock">A Survey of Controllable Text Generation Using Transformer-based Pre-trained Language Models. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib39.1.1">ACM Computing Surveys</em>, 56(3):1–37, 2023a. </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2023b)</span> <span class="ltx_bibblock"> Zhang, X., Li, X., Sultani, W., Zhou, Y., and Wshah, S. </span> <span class="ltx_bibblock">Cross-View Geo-Localization via Learning Disentangled Geometric Layout Correspondence. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib40.1.1">Proceedings of the AAAI Conference on Artificial Intelligence</em>, volume 37, pp. 3480–3488, 2023b. </span> </li> <li class="ltx_bibitem" id="bib.bib41"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhu et al. (2022)</span> <span class="ltx_bibblock"> Zhu, S., Shah, M., and Chen, C. </span> <span class="ltx_bibblock">TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib41.1.1">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</em>, pp. 1162–1171, 2022. </span> </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> <section class="ltx_appendix" id="A1"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix A </span>Implementation Details</h2> <div class="ltx_para" id="A1.p1"> <p class="ltx_p" id="A1.p1.1">Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#A1.T5" title="Table 5 ‣ Appendix A Implementation Details ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">5</span></a> and Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#A1.T6" title="Table 6 ‣ Appendix A Implementation Details ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">6</span></a> present the hyper-parameter settings and training details for the models. We conducted training and testing on Nvidia A800 (80G), with CUDA 12.1, PyTorch 2.0.0, and Transformers 4.33.0.</p> </div> <figure class="ltx_table" id="A1.T5"> <figcaption class="ltx_caption" style="font-size:80%;"><span class="ltx_tag ltx_tag_table">Table 5: </span>The hyper-parameter settings of the proposed <em class="ltx_emph ltx_font_italic" id="A1.T5.5.1">GeoReasoner</em>.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="A1.T5.6"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="A1.T5.6.1.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_tt" id="A1.T5.6.1.1.1"><span class="ltx_text" id="A1.T5.6.1.1.1.1" style="font-size:80%;">Hyper Params</span></th> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T5.6.1.1.2"><span class="ltx_text" id="A1.T5.6.1.1.2.1" style="font-size:80%;">Value</span></td> </tr> <tr class="ltx_tr" id="A1.T5.6.2.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t" id="A1.T5.6.2.2.1"><span class="ltx_text" id="A1.T5.6.2.2.1.1" style="font-size:80%;">Learning Rate</span></th> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T5.6.2.2.2"><span class="ltx_text" id="A1.T5.6.2.2.2.1" style="font-size:80%;">1e-5</span></td> </tr> <tr class="ltx_tr" id="A1.T5.6.3.3"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.3.3.1"><span class="ltx_text" id="A1.T5.6.3.3.1.1" style="font-size:80%;">Total Batch Size</span></th> <td class="ltx_td ltx_align_center" id="A1.T5.6.3.3.2"><span class="ltx_text" id="A1.T5.6.3.3.2.1" style="font-size:80%;">64</span></td> </tr> <tr class="ltx_tr" id="A1.T5.6.4.4"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.4.4.1"><span class="ltx_text" id="A1.T5.6.4.4.1.1" style="font-size:80%;">Weight Decay</span></th> <td class="ltx_td ltx_align_center" id="A1.T5.6.4.4.2"><span class="ltx_text" id="A1.T5.6.4.4.2.1" style="font-size:80%;">0.1</span></td> </tr> <tr class="ltx_tr" id="A1.T5.6.5.5"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.5.5.1"><span class="ltx_text" id="A1.T5.6.5.5.1.1" style="font-size:80%;">Warmup Ratio</span></th> <td class="ltx_td ltx_align_center" id="A1.T5.6.5.5.2"><span class="ltx_text" id="A1.T5.6.5.5.2.1" style="font-size:80%;">0.01</span></td> </tr> <tr class="ltx_tr" id="A1.T5.6.6.6"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.6.6.1"><span class="ltx_text" id="A1.T5.6.6.6.1.1" style="font-size:80%;">Optimizer</span></th> <td class="ltx_td ltx_align_center" id="A1.T5.6.6.6.2"><span class="ltx_text" id="A1.T5.6.6.6.2.1" style="font-size:80%;">AdamW</span></td> </tr> <tr class="ltx_tr" id="A1.T5.6.7.7"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.7.7.1"><span class="ltx_text" id="A1.T5.6.7.7.1.1" style="font-size:80%;">Adam Beta1</span></th> <td class="ltx_td ltx_align_center" id="A1.T5.6.7.7.2"><span class="ltx_text" id="A1.T5.6.7.7.2.1" style="font-size:80%;">0.9</span></td> </tr> <tr class="ltx_tr" id="A1.T5.6.8.8"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.8.8.1"><span class="ltx_text" id="A1.T5.6.8.8.1.1" style="font-size:80%;">Adam Beta2</span></th> <td class="ltx_td ltx_align_center" id="A1.T5.6.8.8.2"><span class="ltx_text" id="A1.T5.6.8.8.2.1" style="font-size:80%;">0.95</span></td> </tr> <tr class="ltx_tr" id="A1.T5.6.9.9"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.9.9.1"><span class="ltx_text" id="A1.T5.6.9.9.1.1" style="font-size:80%;">LR Scheduler</span></th> <td class="ltx_td ltx_align_center" id="A1.T5.6.9.9.2"><span class="ltx_text" id="A1.T5.6.9.9.2.1" style="font-size:80%;">cosine</span></td> </tr> <tr class="ltx_tr" id="A1.T5.6.10.10"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb" id="A1.T5.6.10.10.1"><span class="ltx_text" id="A1.T5.6.10.10.1.1" style="font-size:80%;">Model Max Length</span></th> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T5.6.10.10.2"><span class="ltx_text" id="A1.T5.6.10.10.2.1" style="font-size:80%;">2048</span></td> </tr> </tbody> </table> </figure> <figure class="ltx_table" id="A1.T6"> <figcaption class="ltx_caption" style="font-size:80%;"><span class="ltx_tag ltx_tag_table">Table 6: </span>The training details of the proposed <em class="ltx_emph ltx_font_italic" id="A1.T6.5.1">GeoReasoner</em>.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="A1.T6.6"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="A1.T6.6.1.1"> <th class="ltx_td ltx_th ltx_th_row ltx_border_tt" id="A1.T6.6.1.1.1"></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="A1.T6.6.1.1.2" rowspan="2"><span class="ltx_text" id="A1.T6.6.1.1.2.1" style="font-size:80%;">Training Speed</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="A1.T6.6.1.1.3" rowspan="2"><span class="ltx_text" id="A1.T6.6.1.1.3.1" style="font-size:80%;">Inference Latency</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="2" id="A1.T6.6.1.1.4"><span class="ltx_text" id="A1.T6.6.1.1.4.1" style="font-size:80%;">Num of Params</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="A1.T6.6.1.1.5" rowspan="2"><span class="ltx_text" id="A1.T6.6.1.1.5.1" style="font-size:80%;">Flops</span></th> </tr> <tr class="ltx_tr" id="A1.T6.6.2.2"> <th class="ltx_td ltx_th ltx_th_row" id="A1.T6.6.2.2.1"></th> <td class="ltx_td ltx_align_center" id="A1.T6.6.2.2.2"><span class="ltx_text" id="A1.T6.6.2.2.2.1" style="font-size:80%;">Base Model</span></td> <td class="ltx_td ltx_align_center" id="A1.T6.6.2.2.3"><span class="ltx_text" id="A1.T6.6.2.2.3.1" style="font-size:80%;">LoRA</span></td> </tr> <tr class="ltx_tr" id="A1.T6.6.3.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_t" id="A1.T6.6.3.3.1"><span class="ltx_text" id="A1.T6.6.3.3.1.1" style="font-size:80%;">LoRA1 (reason)</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A1.T6.6.3.3.2"><span class="ltx_text" id="A1.T6.6.3.3.2.1" style="font-size:80%;">0.41 sample/s</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A1.T6.6.3.3.3"><span class="ltx_text" id="A1.T6.6.3.3.3.1" style="font-size:80%;">1.560s</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A1.T6.6.3.3.4"><span class="ltx_text" id="A1.T6.6.3.3.4.1" style="font-size:80%;">9.6B</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A1.T6.6.3.3.5"><span class="ltx_text" id="A1.T6.6.3.3.5.1" style="font-size:80%;">112.19M</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A1.T6.6.3.3.6"><span class="ltx_text" id="A1.T6.6.3.3.6.1" style="font-size:80%;">71.9B</span></th> </tr> <tr class="ltx_tr" id="A1.T6.6.4.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb" id="A1.T6.6.4.4.1"><span class="ltx_text" id="A1.T6.6.4.4.1.1" style="font-size:80%;">LoRA2 (location)</span></th> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T6.6.4.4.2"><span class="ltx_text" id="A1.T6.6.4.4.2.1" style="font-size:80%;">0.63 sample/s</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T6.6.4.4.3"><span class="ltx_text" id="A1.T6.6.4.4.3.1" style="font-size:80%;">0.894s</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T6.6.4.4.4"><span class="ltx_text" id="A1.T6.6.4.4.4.1" style="font-size:80%;">9.6B</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T6.6.4.4.5"><span class="ltx_text" id="A1.T6.6.4.4.5.1" style="font-size:80%;">112.19M</span></td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T6.6.4.4.6"><span class="ltx_text" id="A1.T6.6.4.4.6.1" style="font-size:80%;">71.9B</span></td> </tr> </tbody> </table> </figure> </section> <section class="ltx_appendix" id="A2"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix B </span>Additional Qualitative Results</h2> <div class="ltx_para" id="A2.p1"> <p class="ltx_p" id="A2.p1.1">Additionally, we present the results of the <em class="ltx_emph ltx_font_italic" id="A2.p1.1.1">GeoReasoner</em> on alternative street-view images, depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#A2.F9" title="Figure 9 ‣ Appendix B Additional Qualitative Results ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model"><span class="ltx_text ltx_ref_tag">9</span></a>. Each street view image is annotated with the ground truth geographic location, along with the inference results from <em class="ltx_emph ltx_font_italic" id="A2.p1.1.2">GeoReasoner</em>. It can provide geographical predictions accompanied by reasonable explanations.</p> </div> <figure class="ltx_figure" id="A2.F9"> <p class="ltx_p ltx_align_center ltx_align_center" id="A2.F9.1.1"><span class="ltx_text" id="A2.F9.1.1.1"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="399" id="A2.F9.1.1.1.g1" src="x8.png" width="822"/></span></p> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 9: </span>Additional qualitative results from the proposed <em class="ltx_emph ltx_font_italic" id="A2.F9.3.1">GeoReasoner</em>.</figcaption> </figure> <div class="ltx_pagination ltx_role_newpage"></div> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Thu Oct 17 03:24:45 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src=""/></a> </div></footer> </div> </body> </html>