CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model</title>  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <meta content="Machine Learning, ICML" lang="en" name="keywords"/> <base href="/html/2406.18572v2/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S1" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">1 Introduction</a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S2" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">2 Related work</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S2.SS1" title="In 2 Related work ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">2.1 Street Views</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S2.SS2" title="In 2 Related work ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">2.2 Image-based Geo-localization</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S2.SS3" title="In 2 Related work ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">2.3 Vision-Language Models</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">3 GeoReasoner</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.SS1" title="In 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">3.1 Locatability-Enhanced Data Curation</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.SS2" title="In 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">3.2 Geo-localization with Reasoning</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4 Experiments</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS1" title="In 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.1 Experiments on Locatability-Enhanced Dataset</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS1.SSS1" title="In 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.1.1 Qualitative Comparison</a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS1.SSS2" title="In 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.1.2 Quantitative Comparison</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2" title="In 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.2 Experiments on Geo-localization with Reasoning</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2.SSS1" title="In 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.2.1 Qualitative Comparison with SOTA</a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2.SSS2" title="In 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.2.2 Quantitative Comparison with SOTA</a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2.SSS3" title="In 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.2.3 Ablation Experiments</a></li> <li class="ltx_tocentry ltx_tocentry_subsubsection"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2.SSS4" title="In 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.2.4 Generalizability Evaluation</a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S5" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">5 Discussion</a></li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S6" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">6 Conclusion</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#A1" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">A Implementation Details</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#A2" title="In GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">B Additional Qualitative Results</a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_pruned_first"> <h1 class="ltx_title ltx_title_document">GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model</h1> <div class="ltx_authors"> Ling Li Yu Ye Bingchuan Jiang Wei Zeng </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at <a class="ltx_ref ltx_href" href="https://github.com/lingli1996/GeoReasoner" title="">https://github.com/lingli1996/GeoReasoner</a>. </div> <div class="ltx_keywords">Machine Learning, ICML </div> <div class="ltx_para" id="p2"> </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> 1 Introduction</h2> <div class="ltx_para" id="S1.p1"> Street-view geo-localization seeks to predict geographical locations for the given street-view images. The significance of street-view geo-localization is evident in a variety of applications, spanning social studies <cite class="ltx_cite ltx_citemacro_citep">(Ye et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib36" title="">2019b</a>)</cite>, urban planning <cite class="ltx_cite ltx_citemacro_citep">(Shen et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib27" title="">2018</a>)</cite>, and navigation <cite class="ltx_cite ltx_citemacro_citep">(Chalvatzaras et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib5" title="">2022</a>)</cite>. As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">1</a> (left), existing frameworks for street-view geo-localization can be mainly divided into two categories: retrieval-based and classification-based. Retrieval-based approaches entail identifying the most similar image within a geo-tagged image gallery and returning the corresponding geographical location <cite class="ltx_cite ltx_citemacro_citep">(Zhu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib41" title="">2022</a>; Lin et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib16" title="">2022</a>; Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib40" title="">2023b</a>)</cite>. However, the methods rely on the diversity and comprehensiveness of the geo-tagged image gallery, which can be challenging to curate. Alternatively, classification-based approaches partition the Earth’s surface into distinct regions and assign the input image to a specific region <cite class="ltx_cite ltx_citemacro_citep">(Clark et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib8" title="">2023</a>; Pramanick et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib20" title="">2022</a>; Müller-Budack et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib19" title="">2018</a>; Seo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib25" title="">2018</a>; Weyand et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib32" title="">2016</a>)</cite>. While these methods leverage shared visual features within a single region, they may neglect valuable semantic information (e.g., signboard texts) crucial for geo-localization. More importantly, these classification methods often operate as black-box models, lacking reasoning capabilities for users to interpret. </div> <figure class="ltx_figure" id="S1.F1"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="439" id="S1.F1.1.1.1.g1" src="x1.png" width="822"/> <figcaption class="ltx_caption ltx_centering">Figure 1: Different paradigms in existing and the proposed geo-localization approaches: retrieval-based (left-top), classification-based (left-bottom), and our LVLM-based (right).</figcaption> </figure> <div class="ltx_para" id="S1.p2"> Achieving street view-based geo-localization with reasoning capability poses a considerable challenge. This study introduces a new paradigm that facilitates geo-localization with reasoning capability for street-view images, as depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">1</a>(right). The paradigm leverages an LVLM for its excellent capability in handling multi-modal visual and textual inputs and incorporates external knowledge learned from various online games for the reasoning procedure. Specifically, we introduce the concept of locatability as a metric to quantify the degree of locatability in street-view images. On this basis, we devise a CLIP-based visual-text pairing network to match large-scale Google Street View (GSV) images with 3K finely reasoned text-image pairs from online games, to tackle the challenge of the absence of a high-quality street-view dataset. The process filters through over 70K GSV images with geo-tags, all of which exhibit a high degree of locatability. </div> <div class="ltx_para" id="S1.p3"> Next, we construct an LVLM model, named GeoReasoner, to overcome the difficulty of integrating reasoning capability in geo-localization. The training procedures of GeoReasoner are divided into two folds: reasoning tuning and location tuning. In the first stage, we utilize the 3K reasoned text-image pairs encapsulating human inference knowledge, to fine-tune a well-trained LVLM model with LoRA <cite class="ltx_cite ltx_citemacro_citep">(Hu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib13" title="">2022</a>)</cite> for reasoning adaptation. In the second stage, we leverage the curated 70K high-locatability GSV images dataset, to further fine-tune the LVLM model with another LoRA stacked on the first one for location tuning. We assess GeoReasoner in terms of accuracy for both country-level (i.e., predicting the country in which a street view is located) and city-level (i.e., predicting the city in which a street view is located) geo-localization. The results demonstrate that GeoReasoner outperforms the other counterparts by more than 25% at the country-level geo-localization and 38% at the city-level geo-localization with reasoning on our test dataset. Notably, GeoReasoner performs slightly better than StreetCILP <cite class="ltx_cite ltx_citemacro_citep">(Haas et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib11" title="">2023</a>)</cite>, which was trained on a substantially larger dataset of 1.1 million geo-tagged street-view images. We also evaluate GeoReasoner against state-of-the-art models for geo-localization using open benchmark datasets. The results show that GeoReasoner achieves comparable performance with only 10k Flickr images used for training. The main contributions of our work are: <ul class="ltx_itemize" id="S1.I1"> <li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i1.p1"> We present a new paradigm that leverages an LVLM and external knowledge of human inference for geo-localization with reasoning from street-view images. </div> </li> <li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i2.p1"> We introduce the concept of locatability and devise a CLIP-based network to quantify the degree of locatability in street-view images. </div> </li> <li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;"> • <div class="ltx_para" id="S1.I1.i3.p1"> We propose GeoReasoner, an LVLM that outperforms existing geo-localization models and provides detailed reasoning for the inferred results. </div> </li> </ul> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> 2 Related work</h2> <section class="ltx_subsection" id="S2.SS1"> <h3 class="ltx_title ltx_title_subsection"> 2.1 Street Views</h3> <div class="ltx_para" id="S2.SS1.p1"> Street views, as the realm of physical environments routinely accessed and engaged with in daily life, bear significant relevance to human perception <cite class="ltx_cite ltx_citemacro_citep">(Ye et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib36" title="">2019b</a>)</cite> and urban design <cite class="ltx_cite ltx_citemacro_citep">(Shen et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib27" title="">2018</a>)</cite>. Analyses of street views contribute to decision-making support <cite class="ltx_cite ltx_citemacro_citep">(Ye et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib35" title="">2019a</a>)</cite>, improved understanding of urban social and economic structures <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib3" title="">2023b</a>)</cite>, and traffic asset monitoring and maintenance <cite class="ltx_cite ltx_citemacro_citep">(Campbell et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib4" title="">2019</a>; Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib15" title="">2021</a>)</cite>. This study places an emphasis on geo-localization based on street views. Specifically, drawing motivation from <cite class="ltx_cite ltx_citemacro_citet">Zhang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib38" title="">2018</a>)</cite>, we delineate the distribution of scene elements to quantify the degree of locatability in street views. Highly locatable street-view images are curated to train an LVLM that surpasses existing geo-localization models. </div> </section> <section class="ltx_subsection" id="S2.SS2"> <h3 class="ltx_title ltx_title_subsection"> 2.2 Image-based Geo-localization</h3> <div class="ltx_para" id="S2.SS2.p1"> Geo-localization entails determining spatial coordinates on the Earth’s surface, with broad applications in practical scenarios, including tracking individual trajectories <cite class="ltx_cite ltx_citemacro_citep">(Cheng et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib7" title="">2022</a>)</cite> and positioning autonomous vehicles <cite class="ltx_cite ltx_citemacro_citep">(Chalvatzaras et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib5" title="">2022</a>)</cite>. This study focuses on image-based geo-localization, utilizing image data as input. Research on image-based geo-localization can be primarily classified into two approaches: retrieval-based <cite class="ltx_cite ltx_citemacro_citep">(Zhu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib41" title="">2022</a>; Lin et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib16" title="">2022</a>; Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib40" title="">2023b</a>)</cite> and classification-based <cite class="ltx_cite ltx_citemacro_citep">(Clark et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib8" title="">2023</a>; Pramanick et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib20" title="">2022</a>; Müller-Budack et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib19" title="">2018</a>; Seo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib25" title="">2018</a>; Weyand et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib32" title="">2016</a>)</cite>. </div> <div class="ltx_para" id="S2.SS2.p2"> The retrieval-based approach involves the sequential matching of a single image with a gallery of overhead views, each labeled with geographical coordinates, and identifying the result with the highest matching as the location. However, the utilization of this method is limited due to its requirement for additional reference datasets. The classification-based approach, exemplified by <cite class="ltx_cite ltx_citemacro_citet">Weyand et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib32" title="">2016</a>)</cite>, involves subdividing the Earth’s surface into thousands of geographical cells and predicting the geographical unit to which an image belongs. The prediction effectiveness can be boosted with a dataset comprising millions of street views, whilst the granularity is influenced by the number of subdivided geographical cells. As such, many studies have been devoted to learning to corresponding multi-level features at different granularity <cite class="ltx_cite ltx_citemacro_citep">(Vo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib30" title="">2017</a>)</cite>, or multi-pair features for different tasks <cite class="ltx_cite ltx_citemacro_citep">(Müller-Budack et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib19" title="">2018</a>; Pramanick et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib20" title="">2022</a>; Vivanco Cepeda et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib29" title="">2024</a>)</cite>. </div> <div class="ltx_para" id="S2.SS2.p3"> We approach image-based geo-localization with a novel paradigm. Specifically, we integrate semantic visual concepts that offer locatable features <cite class="ltx_cite ltx_citemacro_citep">(Luo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib18" title="">2022</a>; Theiner et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib28" title="">2022</a>)</cite>, and incorporate human reasoning knowledge learned from geo-localization games using an LVLM. </div> </section> <section class="ltx_subsection" id="S2.SS3"> <h3 class="ltx_title ltx_title_subsection"> 2.3 Vision-Language Models</h3> <div class="ltx_para" id="S2.SS3.p1"> The emergence of Large Language Models (LLMs) has significantly impacted various tasks related to natural language processing. These models exhibit remarkable performance in tasks such as text generation <cite class="ltx_cite ltx_citemacro_citep">(Zhang et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib39" title="">2023a</a>)</cite> and text-based question answering <cite class="ltx_cite ltx_citemacro_citep">(Shao et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib26" title="">2023</a>)</cite>, owing to their robust and versatile capabilities. As a result, research attention has shifted towards exploring prompt engineering techniques to enhance the performance of LLMs in downstream tasks <cite class="ltx_cite ltx_citemacro_citep">(Wei et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib31" title="">2022</a>; Yao et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib34" title="">2024</a>; Dai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib9" title="">2023</a>; Xu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib33" title="">2023</a>; Ying et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib37" title="">2024</a>)</cite>. </div> <div class="ltx_para" id="S2.SS3.p2"> Large vision-language models (LVLMs) integrate visual encoders with LLMs, exhibiting remarkable effectiveness in visual question-answering tasks <cite class="ltx_cite ltx_citemacro_citep">(Liu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib17" title="">2024</a>; Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib2" title="">2023a</a>; Rao et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib23" title="">2023</a>)</cite>. This study harnesses the capabilities of LVLMs to address geo-localization of street views. However, the optimal utilization of LVLMs remains a challenging issue, particularly due to the absence of high-quality training data and a lack of reasoning capabilities. We overcome these challenges through an innovative paradigm and the thoughtful design of model architecture, contributing to a more effective utilization of LVLMs in this domain. </div> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> 3 GeoReasoner</h2> <div class="ltx_para" id="S3.p1"> This section outlines our approach to addressing two challenges: 1) the absence of a high-quality street-view geo-localization dataset (discussed in Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.SS1" title="3.1 Locatability-Enhanced Data Curation ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">3.1</a>), and 2) the difficulty of integrating reasoning in geo-localization (discussed in Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.SS2" title="3.2 Geo-localization with Reasoning ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">3.2</a>), when constructing GeoReasoner. </div> <figure class="ltx_figure" id="S3.F2"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="394" id="S3.F2.1.1.g1" src="x2.png" width="822"/> <figcaption class="ltx_caption ltx_centering">Figure 2: The locatability quantization network devises a CLIP-based visual-text pairing approach to predict the locatability metric.</figcaption> </figure> <figure class="ltx_figure" id="S3.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="260" id="S3.F3.g1" src="x3.png" width="822"/> <figcaption class="ltx_caption ltx_centering">Figure 3: The architecture of GeoReasoner consists of three modules: Vision Encoder, VL Adapter and Pre-trained LLM. The model undergoes a two-fold supervised fine-tuning process: reasoning tuning and location tuning, to enable geo-localization with reasoning. </figcaption> </figure> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> 3.1 Locatability-Enhanced Data Curation</h3> <div class="ltx_para" id="S3.SS1.p1"> Throughout the development of this work, we observed variations in the degree of locatability among different street views. For example, the images featuring textual signboards or prominent landmarks (e.g., Eiffel Tower) are easily locatable, whilst those captured in a tunnel or obscured by a wall tend to be less locatable. Refer to Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.F4" title="Figure 4 ‣ 3.2 Geo-localization with Reasoning ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4</a> for further illustration. Simply merging all these street-view images to train an LVLM is not an optimal approach, as the inclusion of poor-quality data can adversely affect the training efficiency of updating an LVLM <cite class="ltx_cite ltx_citemacro_citep">(Radford et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib22" title="">2021</a>)</cite>. To this end, we introduce locatability, a metric that quantifies the level of locatability of street-view images. We then devise a CLIP-based visual-text pairing network to produce the desired locatability metric for an input street-view image, as shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.F2" title="Figure 2 ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">2</a>. The network naturally incorporates data from two perspectives: </div> <div class="ltx_para" id="S3.SS1.p2"> <ul class="ltx_itemize" id="S3.I1"> <li class="ltx_item" id="S3.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S3.I1.i1.p1"> Street-View Images. We collected street-view images from the Google Street View111https://www.google.com/streetview (GSV). To enrich the diversity of the dataset, we first selected the top global cities according to the Globalization and World Cities Study Group and Network (GaWC) ranking. Next, we utilized the global OpenStreetMap222https://www.openstreetmap.org (OSM) geographic database to obtain the vector data of the road network in these urban areas. The road network was passed to ArcPy, a Python site package of ArcGIS, to automatically extract sampling points at 4000-meter intervals and generate a CSV table containing information about these sampling points. Subsequently, we employed the GSV API interface to compile a comprehensive dataset. This dataset encompassed street-view images captured from four distinct directions - front, back, left, and right - of each sampling point. Considering the impact of data sparsity and image similarity, we randomly selected two of the four views from each data point, denoted as [<math alttext="\textbf{I}_{x},\textbf{I}_{y}" class="ltx_Math" display="inline" id="S3.I1.i1.p1.1.m1.2"><semantics id="S3.I1.i1.p1.1.m1.2a"><mrow id="S3.I1.i1.p1.1.m1.2.2.2" xref="S3.I1.i1.p1.1.m1.2.2.3.cmml"><msub id="S3.I1.i1.p1.1.m1.1.1.1.1" xref="S3.I1.i1.p1.1.m1.1.1.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.I1.i1.p1.1.m1.1.1.1.1.2" xref="S3.I1.i1.p1.1.m1.1.1.1.1.2a.cmml">I</mtext><mi id="S3.I1.i1.p1.1.m1.1.1.1.1.3" xref="S3.I1.i1.p1.1.m1.1.1.1.1.3.cmml">x</mi></msub><mo id="S3.I1.i1.p1.1.m1.2.2.2.3" xref="S3.I1.i1.p1.1.m1.2.2.3.cmml">,</mo><msub id="S3.I1.i1.p1.1.m1.2.2.2.2" xref="S3.I1.i1.p1.1.m1.2.2.2.2.cmml"><mtext class="ltx_mathvariant_bold" id="S3.I1.i1.p1.1.m1.2.2.2.2.2" xref="S3.I1.i1.p1.1.m1.2.2.2.2.2a.cmml">I</mtext><mi id="S3.I1.i1.p1.1.m1.2.2.2.2.3" xref="S3.I1.i1.p1.1.m1.2.2.2.2.3.cmml">y</mi></msub></mrow><annotation-xml encoding="MathML-Content" id="S3.I1.i1.p1.1.m1.2b"><list id="S3.I1.i1.p1.1.m1.2.2.3.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2"><apply id="S3.I1.i1.p1.1.m1.1.1.1.1.cmml" xref="S3.I1.i1.p1.1.m1.1.1.1.1"><csymbol cd="ambiguous" id="S3.I1.i1.p1.1.m1.1.1.1.1.1.cmml" xref="S3.I1.i1.p1.1.m1.1.1.1.1">subscript</csymbol><ci id="S3.I1.i1.p1.1.m1.1.1.1.1.2a.cmml" xref="S3.I1.i1.p1.1.m1.1.1.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.I1.i1.p1.1.m1.1.1.1.1.2.cmml" xref="S3.I1.i1.p1.1.m1.1.1.1.1.2">I</mtext></ci><ci id="S3.I1.i1.p1.1.m1.1.1.1.1.3.cmml" xref="S3.I1.i1.p1.1.m1.1.1.1.1.3">𝑥</ci></apply><apply id="S3.I1.i1.p1.1.m1.2.2.2.2.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2.2"><csymbol cd="ambiguous" id="S3.I1.i1.p1.1.m1.2.2.2.2.1.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2.2">subscript</csymbol><ci id="S3.I1.i1.p1.1.m1.2.2.2.2.2a.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2.2.2"><mtext class="ltx_mathvariant_bold" id="S3.I1.i1.p1.1.m1.2.2.2.2.2.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2.2.2">I</mtext></ci><ci id="S3.I1.i1.p1.1.m1.2.2.2.2.3.cmml" xref="S3.I1.i1.p1.1.m1.2.2.2.2.3">𝑦</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S3.I1.i1.p1.1.m1.2c">\textbf{I}_{x},\textbf{I}_{y}</annotation><annotation encoding="application/x-llamapun" id="S3.I1.i1.p1.1.m1.2d">I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , I start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT</annotation></semantics></math>], where <math alttext="x\in(left,right)" class="ltx_Math" display="inline" id="S3.I1.i1.p1.2.m2.2"><semantics id="S3.I1.i1.p1.2.m2.2a"><mrow id="S3.I1.i1.p1.2.m2.2.2" xref="S3.I1.i1.p1.2.m2.2.2.cmml"><mi id="S3.I1.i1.p1.2.m2.2.2.4" xref="S3.I1.i1.p1.2.m2.2.2.4.cmml">x</mi><mo id="S3.I1.i1.p1.2.m2.2.2.3" xref="S3.I1.i1.p1.2.m2.2.2.3.cmml">∈</mo><mrow id="S3.I1.i1.p1.2.m2.2.2.2.2" xref="S3.I1.i1.p1.2.m2.2.2.2.3.cmml"><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.3" stretchy="false" xref="S3.I1.i1.p1.2.m2.2.2.2.3.cmml">(</mo><mrow id="S3.I1.i1.p1.2.m2.1.1.1.1.1" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.cmml"><mi id="S3.I1.i1.p1.2.m2.1.1.1.1.1.2" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.2.cmml">l</mi><mo id="S3.I1.i1.p1.2.m2.1.1.1.1.1.1" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.2.m2.1.1.1.1.1.3" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.3.cmml">e</mi><mo id="S3.I1.i1.p1.2.m2.1.1.1.1.1.1a" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.2.m2.1.1.1.1.1.4" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.4.cmml">f</mi><mo id="S3.I1.i1.p1.2.m2.1.1.1.1.1.1b" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.2.m2.1.1.1.1.1.5" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.5.cmml">t</mi></mrow><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.4" xref="S3.I1.i1.p1.2.m2.2.2.2.3.cmml">,</mo><mrow id="S3.I1.i1.p1.2.m2.2.2.2.2.2" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.cmml"><mi id="S3.I1.i1.p1.2.m2.2.2.2.2.2.2" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.2.cmml">r</mi><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.2.1" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.2.m2.2.2.2.2.2.3" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.3.cmml">i</mi><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.2.1a" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.2.m2.2.2.2.2.2.4" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.4.cmml">g</mi><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.2.1b" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.2.m2.2.2.2.2.2.5" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.5.cmml">h</mi><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.2.1c" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.2.m2.2.2.2.2.2.6" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.6.cmml">t</mi></mrow><mo id="S3.I1.i1.p1.2.m2.2.2.2.2.5" stretchy="false" xref="S3.I1.i1.p1.2.m2.2.2.2.3.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.I1.i1.p1.2.m2.2b"><apply id="S3.I1.i1.p1.2.m2.2.2.cmml" xref="S3.I1.i1.p1.2.m2.2.2"><in id="S3.I1.i1.p1.2.m2.2.2.3.cmml" xref="S3.I1.i1.p1.2.m2.2.2.3"></in><ci id="S3.I1.i1.p1.2.m2.2.2.4.cmml" xref="S3.I1.i1.p1.2.m2.2.2.4">𝑥</ci><interval closure="open" id="S3.I1.i1.p1.2.m2.2.2.2.3.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2"><apply id="S3.I1.i1.p1.2.m2.1.1.1.1.1.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1"><times id="S3.I1.i1.p1.2.m2.1.1.1.1.1.1.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.1"></times><ci id="S3.I1.i1.p1.2.m2.1.1.1.1.1.2.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.2">𝑙</ci><ci id="S3.I1.i1.p1.2.m2.1.1.1.1.1.3.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.3">𝑒</ci><ci id="S3.I1.i1.p1.2.m2.1.1.1.1.1.4.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.4">𝑓</ci><ci id="S3.I1.i1.p1.2.m2.1.1.1.1.1.5.cmml" xref="S3.I1.i1.p1.2.m2.1.1.1.1.1.5">𝑡</ci></apply><apply id="S3.I1.i1.p1.2.m2.2.2.2.2.2.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2"><times id="S3.I1.i1.p1.2.m2.2.2.2.2.2.1.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.1"></times><ci id="S3.I1.i1.p1.2.m2.2.2.2.2.2.2.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.2">𝑟</ci><ci id="S3.I1.i1.p1.2.m2.2.2.2.2.2.3.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.3">𝑖</ci><ci id="S3.I1.i1.p1.2.m2.2.2.2.2.2.4.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.4">𝑔</ci><ci id="S3.I1.i1.p1.2.m2.2.2.2.2.2.5.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.5">ℎ</ci><ci id="S3.I1.i1.p1.2.m2.2.2.2.2.2.6.cmml" xref="S3.I1.i1.p1.2.m2.2.2.2.2.2.6">𝑡</ci></apply></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.I1.i1.p1.2.m2.2c">x\in(left,right)</annotation><annotation encoding="application/x-llamapun" id="S3.I1.i1.p1.2.m2.2d">italic_x ∈ ( italic_l italic_e italic_f italic_t , italic_r italic_i italic_g italic_h italic_t )</annotation></semantics></math>, <math alttext="y\in(front,back)" class="ltx_Math" display="inline" id="S3.I1.i1.p1.3.m3.2"><semantics id="S3.I1.i1.p1.3.m3.2a"><mrow id="S3.I1.i1.p1.3.m3.2.2" xref="S3.I1.i1.p1.3.m3.2.2.cmml"><mi id="S3.I1.i1.p1.3.m3.2.2.4" xref="S3.I1.i1.p1.3.m3.2.2.4.cmml">y</mi><mo id="S3.I1.i1.p1.3.m3.2.2.3" xref="S3.I1.i1.p1.3.m3.2.2.3.cmml">∈</mo><mrow id="S3.I1.i1.p1.3.m3.2.2.2.2" xref="S3.I1.i1.p1.3.m3.2.2.2.3.cmml"><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.3" stretchy="false" xref="S3.I1.i1.p1.3.m3.2.2.2.3.cmml">(</mo><mrow id="S3.I1.i1.p1.3.m3.1.1.1.1.1" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.cmml"><mi id="S3.I1.i1.p1.3.m3.1.1.1.1.1.2" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.2.cmml">f</mi><mo id="S3.I1.i1.p1.3.m3.1.1.1.1.1.1" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.3.m3.1.1.1.1.1.3" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.3.cmml">r</mi><mo id="S3.I1.i1.p1.3.m3.1.1.1.1.1.1a" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.3.m3.1.1.1.1.1.4" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.4.cmml">o</mi><mo id="S3.I1.i1.p1.3.m3.1.1.1.1.1.1b" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.3.m3.1.1.1.1.1.5" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.5.cmml">n</mi><mo id="S3.I1.i1.p1.3.m3.1.1.1.1.1.1c" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.3.m3.1.1.1.1.1.6" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.6.cmml">t</mi></mrow><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.4" xref="S3.I1.i1.p1.3.m3.2.2.2.3.cmml">,</mo><mrow id="S3.I1.i1.p1.3.m3.2.2.2.2.2" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.cmml"><mi id="S3.I1.i1.p1.3.m3.2.2.2.2.2.2" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.2.cmml">b</mi><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.2.1" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.3.m3.2.2.2.2.2.3" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.3.cmml">a</mi><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.2.1a" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.3.m3.2.2.2.2.2.4" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.4.cmml">c</mi><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.2.1b" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.1.cmml">⁢</mo><mi id="S3.I1.i1.p1.3.m3.2.2.2.2.2.5" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.5.cmml">k</mi></mrow><mo id="S3.I1.i1.p1.3.m3.2.2.2.2.5" stretchy="false" xref="S3.I1.i1.p1.3.m3.2.2.2.3.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.I1.i1.p1.3.m3.2b"><apply id="S3.I1.i1.p1.3.m3.2.2.cmml" xref="S3.I1.i1.p1.3.m3.2.2"><in id="S3.I1.i1.p1.3.m3.2.2.3.cmml" xref="S3.I1.i1.p1.3.m3.2.2.3"></in><ci id="S3.I1.i1.p1.3.m3.2.2.4.cmml" xref="S3.I1.i1.p1.3.m3.2.2.4">𝑦</ci><interval closure="open" id="S3.I1.i1.p1.3.m3.2.2.2.3.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2"><apply id="S3.I1.i1.p1.3.m3.1.1.1.1.1.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1"><times id="S3.I1.i1.p1.3.m3.1.1.1.1.1.1.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.1"></times><ci id="S3.I1.i1.p1.3.m3.1.1.1.1.1.2.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.2">𝑓</ci><ci id="S3.I1.i1.p1.3.m3.1.1.1.1.1.3.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.3">𝑟</ci><ci id="S3.I1.i1.p1.3.m3.1.1.1.1.1.4.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.4">𝑜</ci><ci id="S3.I1.i1.p1.3.m3.1.1.1.1.1.5.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.5">𝑛</ci><ci id="S3.I1.i1.p1.3.m3.1.1.1.1.1.6.cmml" xref="S3.I1.i1.p1.3.m3.1.1.1.1.1.6">𝑡</ci></apply><apply id="S3.I1.i1.p1.3.m3.2.2.2.2.2.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2"><times id="S3.I1.i1.p1.3.m3.2.2.2.2.2.1.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.1"></times><ci id="S3.I1.i1.p1.3.m3.2.2.2.2.2.2.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.2">𝑏</ci><ci id="S3.I1.i1.p1.3.m3.2.2.2.2.2.3.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.3">𝑎</ci><ci id="S3.I1.i1.p1.3.m3.2.2.2.2.2.4.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.4">𝑐</ci><ci id="S3.I1.i1.p1.3.m3.2.2.2.2.2.5.cmml" xref="S3.I1.i1.p1.3.m3.2.2.2.2.2.5">𝑘</ci></apply></interval></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.I1.i1.p1.3.m3.2c">y\in(front,back)</annotation><annotation encoding="application/x-llamapun" id="S3.I1.i1.p1.3.m3.2d">italic_y ∈ ( italic_f italic_r italic_o italic_n italic_t , italic_b italic_a italic_c italic_k )</annotation></semantics></math>. The process has yielded a total of over 130k street-view images with geo-tags collected from 72 cities in 48 countries. </div> </li> <li class="ltx_item" id="S3.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S3.I1.i2.p1"> Textual Clues. Textual clues often serve a pivotal role in delineating the geographical locations of street-view images. Two prominent games, GeoGuessr333https://www.geoguessr.com and Tuxun444https://tuxun.fun, which focus on geo-localization through street views, offer a potential solution to this gap. Their communities have collaboratively curated a well-organized collection of textual clues, used for pinpointing geographical locations across various countries and cities. These clues, maintained by both players and administrators, provide valuable domain knowledge that aids in identifying and evaluating key geographical features in street views. While such datasets now exist <cite class="ltx_cite ltx_citemacro_citep">(Luo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib18" title="">2022</a>)</cite>, there are no readily available image-text data pairs specifically tailored for LVLM training. To bridge this gap, we gathered image-text pairs for geo-localization from these two open-source communities. Subsequently, we utilized the BERT-based Named Entity Recognition (NER) <cite class="ltx_cite ltx_citemacro_citep">(Kenton & Toutanova, <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib14" title="">2019</a>)</cite> model to clean and filter text that lacked specific geographical location information. In this way, we collected a total of over 3K textual clues that encapsulate rich geo-localization information. For instance, “houses in central Chile are more likely to have terracotta tiled roofs”. Each clue is paired with a corresponding street-view image. </div> </li> </ul> </div> <div class="ltx_para" id="S3.SS1.p3"> With the GSV images and textual clues, our subsequent task is to filter GSV images with a high degree of locatability, for the purpose of training an LVLM. To achieve this, we design a CLIP-based visual-text pairing network. As depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.F2" title="Figure 2 ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">2</a>, the GSV images undergo processing by an image encoder that deduces the image attributes. </div> <div class="ltx_para" id="S3.SS1.p4"> Here, we first use MaskFormer <cite class="ltx_cite ltx_citemacro_citep">(Cheng et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib6" title="">2021</a>)</cite> to predict segmentation masks for various classes in GSV images, such as buildings, sky, and vehicles. We then compute an <math alttext="n" class="ltx_Math" display="inline" id="S3.SS1.p4.1.m1.1"><semantics id="S3.SS1.p4.1.m1.1a"><mi id="S3.SS1.p4.1.m1.1.1" xref="S3.SS1.p4.1.m1.1.1.cmml">n</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.1.m1.1b"><ci id="S3.SS1.p4.1.m1.1.1.cmml" xref="S3.SS1.p4.1.m1.1.1">𝑛</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.1.m1.1c">n</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.1.m1.1d">italic_n</annotation></semantics></math>-length vector <math alttext="\textbf{I}_{seg}" class="ltx_Math" display="inline" id="S3.SS1.p4.2.m2.1"><semantics id="S3.SS1.p4.2.m2.1a"><msub id="S3.SS1.p4.2.m2.1.1" xref="S3.SS1.p4.2.m2.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.2.m2.1.1.2" xref="S3.SS1.p4.2.m2.1.1.2a.cmml">I</mtext><mrow id="S3.SS1.p4.2.m2.1.1.3" xref="S3.SS1.p4.2.m2.1.1.3.cmml"><mi id="S3.SS1.p4.2.m2.1.1.3.2" xref="S3.SS1.p4.2.m2.1.1.3.2.cmml">s</mi><mo id="S3.SS1.p4.2.m2.1.1.3.1" xref="S3.SS1.p4.2.m2.1.1.3.1.cmml">⁢</mo><mi id="S3.SS1.p4.2.m2.1.1.3.3" xref="S3.SS1.p4.2.m2.1.1.3.3.cmml">e</mi><mo id="S3.SS1.p4.2.m2.1.1.3.1a" xref="S3.SS1.p4.2.m2.1.1.3.1.cmml">⁢</mo><mi id="S3.SS1.p4.2.m2.1.1.3.4" xref="S3.SS1.p4.2.m2.1.1.3.4.cmml">g</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.2.m2.1b"><apply id="S3.SS1.p4.2.m2.1.1.cmml" xref="S3.SS1.p4.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS1.p4.2.m2.1.1.1.cmml" xref="S3.SS1.p4.2.m2.1.1">subscript</csymbol><ci id="S3.SS1.p4.2.m2.1.1.2a.cmml" xref="S3.SS1.p4.2.m2.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.2.m2.1.1.2.cmml" xref="S3.SS1.p4.2.m2.1.1.2">I</mtext></ci><apply id="S3.SS1.p4.2.m2.1.1.3.cmml" xref="S3.SS1.p4.2.m2.1.1.3"><times id="S3.SS1.p4.2.m2.1.1.3.1.cmml" xref="S3.SS1.p4.2.m2.1.1.3.1"></times><ci id="S3.SS1.p4.2.m2.1.1.3.2.cmml" xref="S3.SS1.p4.2.m2.1.1.3.2">𝑠</ci><ci id="S3.SS1.p4.2.m2.1.1.3.3.cmml" xref="S3.SS1.p4.2.m2.1.1.3.3">𝑒</ci><ci id="S3.SS1.p4.2.m2.1.1.3.4.cmml" xref="S3.SS1.p4.2.m2.1.1.3.4">𝑔</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.2.m2.1c">\textbf{I}_{seg}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.2.m2.1d">I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT</annotation></semantics></math>, which quantifies the area ratio of each mask class, where <math alttext="n" class="ltx_Math" display="inline" id="S3.SS1.p4.3.m3.1"><semantics id="S3.SS1.p4.3.m3.1a"><mi id="S3.SS1.p4.3.m3.1.1" xref="S3.SS1.p4.3.m3.1.1.cmml">n</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.3.m3.1b"><ci id="S3.SS1.p4.3.m3.1.1.cmml" xref="S3.SS1.p4.3.m3.1.1">𝑛</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.3.m3.1c">n</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.3.m3.1d">italic_n</annotation></semantics></math> represents the number of classes. Subsequently, we utilize Sentence-BERT <cite class="ltx_cite ltx_citemacro_citep">(Reimers & Gurevych, <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib24" title="">2019</a>)</cite> to measure the similarity between textual clues and semantic segmentation labels, yielding an <math alttext="m\times n" class="ltx_Math" display="inline" id="S3.SS1.p4.4.m4.1"><semantics id="S3.SS1.p4.4.m4.1a"><mrow id="S3.SS1.p4.4.m4.1.1" xref="S3.SS1.p4.4.m4.1.1.cmml"><mi id="S3.SS1.p4.4.m4.1.1.2" xref="S3.SS1.p4.4.m4.1.1.2.cmml">m</mi><mo id="S3.SS1.p4.4.m4.1.1.1" lspace="0.222em" rspace="0.222em" xref="S3.SS1.p4.4.m4.1.1.1.cmml">×</mo><mi id="S3.SS1.p4.4.m4.1.1.3" xref="S3.SS1.p4.4.m4.1.1.3.cmml">n</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.4.m4.1b"><apply id="S3.SS1.p4.4.m4.1.1.cmml" xref="S3.SS1.p4.4.m4.1.1"><times id="S3.SS1.p4.4.m4.1.1.1.cmml" xref="S3.SS1.p4.4.m4.1.1.1"></times><ci id="S3.SS1.p4.4.m4.1.1.2.cmml" xref="S3.SS1.p4.4.m4.1.1.2">𝑚</ci><ci id="S3.SS1.p4.4.m4.1.1.3.cmml" xref="S3.SS1.p4.4.m4.1.1.3">𝑛</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.4.m4.1c">m\times n</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.4.m4.1d">italic_m × italic_n</annotation></semantics></math> matrix <math alttext="M" class="ltx_Math" display="inline" id="S3.SS1.p4.5.m5.1"><semantics id="S3.SS1.p4.5.m5.1a"><mi id="S3.SS1.p4.5.m5.1.1" xref="S3.SS1.p4.5.m5.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.5.m5.1b"><ci id="S3.SS1.p4.5.m5.1.1.cmml" xref="S3.SS1.p4.5.m5.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.5.m5.1c">M</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.5.m5.1d">italic_M</annotation></semantics></math>, where <math alttext="m" class="ltx_Math" display="inline" id="S3.SS1.p4.6.m6.1"><semantics id="S3.SS1.p4.6.m6.1a"><mi id="S3.SS1.p4.6.m6.1.1" xref="S3.SS1.p4.6.m6.1.1.cmml">m</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.6.m6.1b"><ci id="S3.SS1.p4.6.m6.1.1.cmml" xref="S3.SS1.p4.6.m6.1.1">𝑚</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.6.m6.1c">m</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.6.m6.1d">italic_m</annotation></semantics></math> is the number of textual clues. After that, we normalize <math alttext="M" class="ltx_Math" display="inline" id="S3.SS1.p4.7.m7.1"><semantics id="S3.SS1.p4.7.m7.1a"><mi id="S3.SS1.p4.7.m7.1.1" xref="S3.SS1.p4.7.m7.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.7.m7.1b"><ci id="S3.SS1.p4.7.m7.1.1.cmml" xref="S3.SS1.p4.7.m7.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.7.m7.1c">M</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.7.m7.1d">italic_M</annotation></semantics></math> using min-max normalization, and set values lower than the threshold to zero, resulting in another <math alttext="m\times n" class="ltx_Math" display="inline" id="S3.SS1.p4.8.m8.1"><semantics id="S3.SS1.p4.8.m8.1a"><mrow id="S3.SS1.p4.8.m8.1.1" xref="S3.SS1.p4.8.m8.1.1.cmml"><mi id="S3.SS1.p4.8.m8.1.1.2" xref="S3.SS1.p4.8.m8.1.1.2.cmml">m</mi><mo id="S3.SS1.p4.8.m8.1.1.1" lspace="0.222em" rspace="0.222em" xref="S3.SS1.p4.8.m8.1.1.1.cmml">×</mo><mi id="S3.SS1.p4.8.m8.1.1.3" xref="S3.SS1.p4.8.m8.1.1.3.cmml">n</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.8.m8.1b"><apply id="S3.SS1.p4.8.m8.1.1.cmml" xref="S3.SS1.p4.8.m8.1.1"><times id="S3.SS1.p4.8.m8.1.1.1.cmml" xref="S3.SS1.p4.8.m8.1.1.1"></times><ci id="S3.SS1.p4.8.m8.1.1.2.cmml" xref="S3.SS1.p4.8.m8.1.1.2">𝑚</ci><ci id="S3.SS1.p4.8.m8.1.1.3.cmml" xref="S3.SS1.p4.8.m8.1.1.3">𝑛</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.8.m8.1c">m\times n</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.8.m8.1d">italic_m × italic_n</annotation></semantics></math> matrix <math alttext="\hat{M}" class="ltx_Math" display="inline" id="S3.SS1.p4.9.m9.1"><semantics id="S3.SS1.p4.9.m9.1a"><mover accent="true" id="S3.SS1.p4.9.m9.1.1" xref="S3.SS1.p4.9.m9.1.1.cmml"><mi id="S3.SS1.p4.9.m9.1.1.2" xref="S3.SS1.p4.9.m9.1.1.2.cmml">M</mi><mo id="S3.SS1.p4.9.m9.1.1.1" xref="S3.SS1.p4.9.m9.1.1.1.cmml">^</mo></mover><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.9.m9.1b"><apply id="S3.SS1.p4.9.m9.1.1.cmml" xref="S3.SS1.p4.9.m9.1.1"><ci id="S3.SS1.p4.9.m9.1.1.1.cmml" xref="S3.SS1.p4.9.m9.1.1.1">^</ci><ci id="S3.SS1.p4.9.m9.1.1.2.cmml" xref="S3.SS1.p4.9.m9.1.1.2">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.9.m9.1c">\hat{M}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.9.m9.1d">over^ start_ARG italic_M end_ARG</annotation></semantics></math>. We reduce <math alttext="\hat{M}" class="ltx_Math" display="inline" id="S3.SS1.p4.10.m10.1"><semantics id="S3.SS1.p4.10.m10.1a"><mover accent="true" id="S3.SS1.p4.10.m10.1.1" xref="S3.SS1.p4.10.m10.1.1.cmml"><mi id="S3.SS1.p4.10.m10.1.1.2" xref="S3.SS1.p4.10.m10.1.1.2.cmml">M</mi><mo id="S3.SS1.p4.10.m10.1.1.1" xref="S3.SS1.p4.10.m10.1.1.1.cmml">^</mo></mover><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.10.m10.1b"><apply id="S3.SS1.p4.10.m10.1.1.cmml" xref="S3.SS1.p4.10.m10.1.1"><ci id="S3.SS1.p4.10.m10.1.1.1.cmml" xref="S3.SS1.p4.10.m10.1.1.1">^</ci><ci id="S3.SS1.p4.10.m10.1.1.2.cmml" xref="S3.SS1.p4.10.m10.1.1.2">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.10.m10.1c">\hat{M}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.10.m10.1d">over^ start_ARG italic_M end_ARG</annotation></semantics></math> to an <math alttext="n" class="ltx_Math" display="inline" id="S3.SS1.p4.11.m11.1"><semantics id="S3.SS1.p4.11.m11.1a"><mi id="S3.SS1.p4.11.m11.1.1" xref="S3.SS1.p4.11.m11.1.1.cmml">n</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.11.m11.1b"><ci id="S3.SS1.p4.11.m11.1.1.cmml" xref="S3.SS1.p4.11.m11.1.1">𝑛</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.11.m11.1c">n</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.11.m11.1d">italic_n</annotation></semantics></math>-length vector by calculating the mean across its rows, and then normalize it to obtain <math alttext="\textbf{w}_{loc}" class="ltx_Math" display="inline" id="S3.SS1.p4.12.m12.1"><semantics id="S3.SS1.p4.12.m12.1a"><msub id="S3.SS1.p4.12.m12.1.1" xref="S3.SS1.p4.12.m12.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.12.m12.1.1.2" xref="S3.SS1.p4.12.m12.1.1.2a.cmml">w</mtext><mrow id="S3.SS1.p4.12.m12.1.1.3" xref="S3.SS1.p4.12.m12.1.1.3.cmml"><mi id="S3.SS1.p4.12.m12.1.1.3.2" xref="S3.SS1.p4.12.m12.1.1.3.2.cmml">l</mi><mo id="S3.SS1.p4.12.m12.1.1.3.1" xref="S3.SS1.p4.12.m12.1.1.3.1.cmml">⁢</mo><mi id="S3.SS1.p4.12.m12.1.1.3.3" xref="S3.SS1.p4.12.m12.1.1.3.3.cmml">o</mi><mo id="S3.SS1.p4.12.m12.1.1.3.1a" xref="S3.SS1.p4.12.m12.1.1.3.1.cmml">⁢</mo><mi id="S3.SS1.p4.12.m12.1.1.3.4" xref="S3.SS1.p4.12.m12.1.1.3.4.cmml">c</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.12.m12.1b"><apply id="S3.SS1.p4.12.m12.1.1.cmml" xref="S3.SS1.p4.12.m12.1.1"><csymbol cd="ambiguous" id="S3.SS1.p4.12.m12.1.1.1.cmml" xref="S3.SS1.p4.12.m12.1.1">subscript</csymbol><ci id="S3.SS1.p4.12.m12.1.1.2a.cmml" xref="S3.SS1.p4.12.m12.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.12.m12.1.1.2.cmml" xref="S3.SS1.p4.12.m12.1.1.2">w</mtext></ci><apply id="S3.SS1.p4.12.m12.1.1.3.cmml" xref="S3.SS1.p4.12.m12.1.1.3"><times id="S3.SS1.p4.12.m12.1.1.3.1.cmml" xref="S3.SS1.p4.12.m12.1.1.3.1"></times><ci id="S3.SS1.p4.12.m12.1.1.3.2.cmml" xref="S3.SS1.p4.12.m12.1.1.3.2">𝑙</ci><ci id="S3.SS1.p4.12.m12.1.1.3.3.cmml" xref="S3.SS1.p4.12.m12.1.1.3.3">𝑜</ci><ci id="S3.SS1.p4.12.m12.1.1.3.4.cmml" xref="S3.SS1.p4.12.m12.1.1.3.4">𝑐</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.12.m12.1c">\textbf{w}_{loc}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.12.m12.1d">w start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT</annotation></semantics></math>. This vector represents the importance of each semantic segmentation label for geo-localization. With the segmentation mask ratio <math alttext="\textbf{I}_{seg}" class="ltx_Math" display="inline" id="S3.SS1.p4.13.m13.1"><semantics id="S3.SS1.p4.13.m13.1a"><msub id="S3.SS1.p4.13.m13.1.1" xref="S3.SS1.p4.13.m13.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.13.m13.1.1.2" xref="S3.SS1.p4.13.m13.1.1.2a.cmml">I</mtext><mrow id="S3.SS1.p4.13.m13.1.1.3" xref="S3.SS1.p4.13.m13.1.1.3.cmml"><mi id="S3.SS1.p4.13.m13.1.1.3.2" xref="S3.SS1.p4.13.m13.1.1.3.2.cmml">s</mi><mo id="S3.SS1.p4.13.m13.1.1.3.1" xref="S3.SS1.p4.13.m13.1.1.3.1.cmml">⁢</mo><mi id="S3.SS1.p4.13.m13.1.1.3.3" xref="S3.SS1.p4.13.m13.1.1.3.3.cmml">e</mi><mo id="S3.SS1.p4.13.m13.1.1.3.1a" xref="S3.SS1.p4.13.m13.1.1.3.1.cmml">⁢</mo><mi id="S3.SS1.p4.13.m13.1.1.3.4" xref="S3.SS1.p4.13.m13.1.1.3.4.cmml">g</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.13.m13.1b"><apply id="S3.SS1.p4.13.m13.1.1.cmml" xref="S3.SS1.p4.13.m13.1.1"><csymbol cd="ambiguous" id="S3.SS1.p4.13.m13.1.1.1.cmml" xref="S3.SS1.p4.13.m13.1.1">subscript</csymbol><ci id="S3.SS1.p4.13.m13.1.1.2a.cmml" xref="S3.SS1.p4.13.m13.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.13.m13.1.1.2.cmml" xref="S3.SS1.p4.13.m13.1.1.2">I</mtext></ci><apply id="S3.SS1.p4.13.m13.1.1.3.cmml" xref="S3.SS1.p4.13.m13.1.1.3"><times id="S3.SS1.p4.13.m13.1.1.3.1.cmml" xref="S3.SS1.p4.13.m13.1.1.3.1"></times><ci id="S3.SS1.p4.13.m13.1.1.3.2.cmml" xref="S3.SS1.p4.13.m13.1.1.3.2">𝑠</ci><ci id="S3.SS1.p4.13.m13.1.1.3.3.cmml" xref="S3.SS1.p4.13.m13.1.1.3.3">𝑒</ci><ci id="S3.SS1.p4.13.m13.1.1.3.4.cmml" xref="S3.SS1.p4.13.m13.1.1.3.4">𝑔</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.13.m13.1c">\textbf{I}_{seg}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.13.m13.1d">I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT</annotation></semantics></math> and the corresponding weight <math alttext="\textbf{w}_{loc}" class="ltx_Math" display="inline" id="S3.SS1.p4.14.m14.1"><semantics id="S3.SS1.p4.14.m14.1a"><msub id="S3.SS1.p4.14.m14.1.1" xref="S3.SS1.p4.14.m14.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.14.m14.1.1.2" xref="S3.SS1.p4.14.m14.1.1.2a.cmml">w</mtext><mrow id="S3.SS1.p4.14.m14.1.1.3" xref="S3.SS1.p4.14.m14.1.1.3.cmml"><mi id="S3.SS1.p4.14.m14.1.1.3.2" xref="S3.SS1.p4.14.m14.1.1.3.2.cmml">l</mi><mo id="S3.SS1.p4.14.m14.1.1.3.1" xref="S3.SS1.p4.14.m14.1.1.3.1.cmml">⁢</mo><mi id="S3.SS1.p4.14.m14.1.1.3.3" xref="S3.SS1.p4.14.m14.1.1.3.3.cmml">o</mi><mo id="S3.SS1.p4.14.m14.1.1.3.1a" xref="S3.SS1.p4.14.m14.1.1.3.1.cmml">⁢</mo><mi id="S3.SS1.p4.14.m14.1.1.3.4" xref="S3.SS1.p4.14.m14.1.1.3.4.cmml">c</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p4.14.m14.1b"><apply id="S3.SS1.p4.14.m14.1.1.cmml" xref="S3.SS1.p4.14.m14.1.1"><csymbol cd="ambiguous" id="S3.SS1.p4.14.m14.1.1.1.cmml" xref="S3.SS1.p4.14.m14.1.1">subscript</csymbol><ci id="S3.SS1.p4.14.m14.1.1.2a.cmml" xref="S3.SS1.p4.14.m14.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p4.14.m14.1.1.2.cmml" xref="S3.SS1.p4.14.m14.1.1.2">w</mtext></ci><apply id="S3.SS1.p4.14.m14.1.1.3.cmml" xref="S3.SS1.p4.14.m14.1.1.3"><times id="S3.SS1.p4.14.m14.1.1.3.1.cmml" xref="S3.SS1.p4.14.m14.1.1.3.1"></times><ci id="S3.SS1.p4.14.m14.1.1.3.2.cmml" xref="S3.SS1.p4.14.m14.1.1.3.2">𝑙</ci><ci id="S3.SS1.p4.14.m14.1.1.3.3.cmml" xref="S3.SS1.p4.14.m14.1.1.3.3">𝑜</ci><ci id="S3.SS1.p4.14.m14.1.1.3.4.cmml" xref="S3.SS1.p4.14.m14.1.1.3.4">𝑐</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p4.14.m14.1c">\textbf{w}_{loc}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p4.14.m14.1d">w start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT</annotation></semantics></math>, the locatability metric of a GSV image is computed through the multiplication and accumulation of the respective values, as follows: <table class="ltx_equation ltx_eqn_table" id="S3.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="locatability(\textbf{I}_{seg},\textbf{w}_{loc})=\sum_{k=1}^{n}\textbf{I}_{seg}% (k)\cdot\textbf{w}^{k}_{loc},\vspace{-3mm}" class="ltx_Math" display="block" id="S3.E1.m1.2"><semantics id="S3.E1.m1.2a"><mrow id="S3.E1.m1.2.2.1" xref="S3.E1.m1.2.2.1.1.cmml"><mrow id="S3.E1.m1.2.2.1.1" xref="S3.E1.m1.2.2.1.1.cmml"><mrow id="S3.E1.m1.2.2.1.1.2" xref="S3.E1.m1.2.2.1.1.2.cmml"><mi id="S3.E1.m1.2.2.1.1.2.4" xref="S3.E1.m1.2.2.1.1.2.4.cmml">l</mi><mo id="S3.E1.m1.2.2.1.1.2.3" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.5" xref="S3.E1.m1.2.2.1.1.2.5.cmml">o</mi><mo id="S3.E1.m1.2.2.1.1.2.3a" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.6" xref="S3.E1.m1.2.2.1.1.2.6.cmml">c</mi><mo id="S3.E1.m1.2.2.1.1.2.3b" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.7" xref="S3.E1.m1.2.2.1.1.2.7.cmml">a</mi><mo id="S3.E1.m1.2.2.1.1.2.3c" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.8" xref="S3.E1.m1.2.2.1.1.2.8.cmml">t</mi><mo id="S3.E1.m1.2.2.1.1.2.3d" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.9" xref="S3.E1.m1.2.2.1.1.2.9.cmml">a</mi><mo id="S3.E1.m1.2.2.1.1.2.3e" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.10" xref="S3.E1.m1.2.2.1.1.2.10.cmml">b</mi><mo id="S3.E1.m1.2.2.1.1.2.3f" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.11" xref="S3.E1.m1.2.2.1.1.2.11.cmml">i</mi><mo id="S3.E1.m1.2.2.1.1.2.3g" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.12" xref="S3.E1.m1.2.2.1.1.2.12.cmml">l</mi><mo id="S3.E1.m1.2.2.1.1.2.3h" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.13" xref="S3.E1.m1.2.2.1.1.2.13.cmml">i</mi><mo id="S3.E1.m1.2.2.1.1.2.3i" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.14" xref="S3.E1.m1.2.2.1.1.2.14.cmml">t</mi><mo id="S3.E1.m1.2.2.1.1.2.3j" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.15" xref="S3.E1.m1.2.2.1.1.2.15.cmml">y</mi><mo id="S3.E1.m1.2.2.1.1.2.3k" xref="S3.E1.m1.2.2.1.1.2.3.cmml">⁢</mo><mrow id="S3.E1.m1.2.2.1.1.2.2.2" xref="S3.E1.m1.2.2.1.1.2.2.3.cmml"><mo id="S3.E1.m1.2.2.1.1.2.2.2.3" stretchy="false" xref="S3.E1.m1.2.2.1.1.2.2.3.cmml">(</mo><msub id="S3.E1.m1.2.2.1.1.1.1.1.1" xref="S3.E1.m1.2.2.1.1.1.1.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.1.1.1.1.2" xref="S3.E1.m1.2.2.1.1.1.1.1.1.2a.cmml">I</mtext><mrow id="S3.E1.m1.2.2.1.1.1.1.1.1.3" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.cmml"><mi id="S3.E1.m1.2.2.1.1.1.1.1.1.3.2" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.2.cmml">s</mi><mo id="S3.E1.m1.2.2.1.1.1.1.1.1.3.1" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.1.1.1.1.3.3" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.3.cmml">e</mi><mo id="S3.E1.m1.2.2.1.1.1.1.1.1.3.1a" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.1.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.1.1.1.1.3.4" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.4.cmml">g</mi></mrow></msub><mo id="S3.E1.m1.2.2.1.1.2.2.2.4" xref="S3.E1.m1.2.2.1.1.2.2.3.cmml">,</mo><msub id="S3.E1.m1.2.2.1.1.2.2.2.2" xref="S3.E1.m1.2.2.1.1.2.2.2.2.cmml"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.2.2.2.2.2" xref="S3.E1.m1.2.2.1.1.2.2.2.2.2a.cmml">w</mtext><mrow id="S3.E1.m1.2.2.1.1.2.2.2.2.3" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.cmml"><mi id="S3.E1.m1.2.2.1.1.2.2.2.2.3.2" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.2.cmml">l</mi><mo id="S3.E1.m1.2.2.1.1.2.2.2.2.3.1" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.1.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.2.2.2.3.3" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.3.cmml">o</mi><mo id="S3.E1.m1.2.2.1.1.2.2.2.2.3.1a" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.1.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.2.2.2.2.3.4" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.4.cmml">c</mi></mrow></msub><mo id="S3.E1.m1.2.2.1.1.2.2.2.5" stretchy="false" xref="S3.E1.m1.2.2.1.1.2.2.3.cmml">)</mo></mrow></mrow><mo id="S3.E1.m1.2.2.1.1.3" rspace="0.111em" xref="S3.E1.m1.2.2.1.1.3.cmml">=</mo><mrow id="S3.E1.m1.2.2.1.1.4" xref="S3.E1.m1.2.2.1.1.4.cmml"><munderover id="S3.E1.m1.2.2.1.1.4.1" xref="S3.E1.m1.2.2.1.1.4.1.cmml"><mo id="S3.E1.m1.2.2.1.1.4.1.2.2" movablelimits="false" xref="S3.E1.m1.2.2.1.1.4.1.2.2.cmml">∑</mo><mrow id="S3.E1.m1.2.2.1.1.4.1.2.3" xref="S3.E1.m1.2.2.1.1.4.1.2.3.cmml"><mi id="S3.E1.m1.2.2.1.1.4.1.2.3.2" xref="S3.E1.m1.2.2.1.1.4.1.2.3.2.cmml">k</mi><mo id="S3.E1.m1.2.2.1.1.4.1.2.3.1" xref="S3.E1.m1.2.2.1.1.4.1.2.3.1.cmml">=</mo><mn id="S3.E1.m1.2.2.1.1.4.1.2.3.3" xref="S3.E1.m1.2.2.1.1.4.1.2.3.3.cmml">1</mn></mrow><mi id="S3.E1.m1.2.2.1.1.4.1.3" xref="S3.E1.m1.2.2.1.1.4.1.3.cmml">n</mi></munderover><mrow id="S3.E1.m1.2.2.1.1.4.2" xref="S3.E1.m1.2.2.1.1.4.2.cmml"><mrow id="S3.E1.m1.2.2.1.1.4.2.2" xref="S3.E1.m1.2.2.1.1.4.2.2.cmml"><msub id="S3.E1.m1.2.2.1.1.4.2.2.2" xref="S3.E1.m1.2.2.1.1.4.2.2.2.cmml"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.4.2.2.2.2" xref="S3.E1.m1.2.2.1.1.4.2.2.2.2a.cmml">I</mtext><mrow id="S3.E1.m1.2.2.1.1.4.2.2.2.3" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.cmml"><mi id="S3.E1.m1.2.2.1.1.4.2.2.2.3.2" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.2.cmml">s</mi><mo id="S3.E1.m1.2.2.1.1.4.2.2.2.3.1" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.1.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.4.2.2.2.3.3" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.3.cmml">e</mi><mo id="S3.E1.m1.2.2.1.1.4.2.2.2.3.1a" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.1.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.4.2.2.2.3.4" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.4.cmml">g</mi></mrow></msub><mo id="S3.E1.m1.2.2.1.1.4.2.2.1" xref="S3.E1.m1.2.2.1.1.4.2.2.1.cmml">⁢</mo><mrow id="S3.E1.m1.2.2.1.1.4.2.2.3.2" xref="S3.E1.m1.2.2.1.1.4.2.2.cmml"><mo id="S3.E1.m1.2.2.1.1.4.2.2.3.2.1" stretchy="false" xref="S3.E1.m1.2.2.1.1.4.2.2.cmml">(</mo><mi id="S3.E1.m1.1.1" xref="S3.E1.m1.1.1.cmml">k</mi><mo id="S3.E1.m1.2.2.1.1.4.2.2.3.2.2" rspace="0.055em" stretchy="false" xref="S3.E1.m1.2.2.1.1.4.2.2.cmml">)</mo></mrow></mrow><mo id="S3.E1.m1.2.2.1.1.4.2.1" rspace="0.222em" xref="S3.E1.m1.2.2.1.1.4.2.1.cmml">⋅</mo><msubsup id="S3.E1.m1.2.2.1.1.4.2.3" xref="S3.E1.m1.2.2.1.1.4.2.3.cmml"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.4.2.3.2.2" xref="S3.E1.m1.2.2.1.1.4.2.3.2.2a.cmml">w</mtext><mrow id="S3.E1.m1.2.2.1.1.4.2.3.3" xref="S3.E1.m1.2.2.1.1.4.2.3.3.cmml"><mi id="S3.E1.m1.2.2.1.1.4.2.3.3.2" xref="S3.E1.m1.2.2.1.1.4.2.3.3.2.cmml">l</mi><mo id="S3.E1.m1.2.2.1.1.4.2.3.3.1" xref="S3.E1.m1.2.2.1.1.4.2.3.3.1.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.4.2.3.3.3" xref="S3.E1.m1.2.2.1.1.4.2.3.3.3.cmml">o</mi><mo id="S3.E1.m1.2.2.1.1.4.2.3.3.1a" xref="S3.E1.m1.2.2.1.1.4.2.3.3.1.cmml">⁢</mo><mi id="S3.E1.m1.2.2.1.1.4.2.3.3.4" xref="S3.E1.m1.2.2.1.1.4.2.3.3.4.cmml">c</mi></mrow><mi id="S3.E1.m1.2.2.1.1.4.2.3.2.3" xref="S3.E1.m1.2.2.1.1.4.2.3.2.3.cmml">k</mi></msubsup></mrow></mrow></mrow><mo id="S3.E1.m1.2.2.1.2" xref="S3.E1.m1.2.2.1.1.cmml">,</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.E1.m1.2b"><apply id="S3.E1.m1.2.2.1.1.cmml" xref="S3.E1.m1.2.2.1"><eq id="S3.E1.m1.2.2.1.1.3.cmml" xref="S3.E1.m1.2.2.1.1.3"></eq><apply id="S3.E1.m1.2.2.1.1.2.cmml" xref="S3.E1.m1.2.2.1.1.2"><times id="S3.E1.m1.2.2.1.1.2.3.cmml" xref="S3.E1.m1.2.2.1.1.2.3"></times><ci id="S3.E1.m1.2.2.1.1.2.4.cmml" xref="S3.E1.m1.2.2.1.1.2.4">𝑙</ci><ci id="S3.E1.m1.2.2.1.1.2.5.cmml" xref="S3.E1.m1.2.2.1.1.2.5">𝑜</ci><ci id="S3.E1.m1.2.2.1.1.2.6.cmml" xref="S3.E1.m1.2.2.1.1.2.6">𝑐</ci><ci id="S3.E1.m1.2.2.1.1.2.7.cmml" xref="S3.E1.m1.2.2.1.1.2.7">𝑎</ci><ci id="S3.E1.m1.2.2.1.1.2.8.cmml" xref="S3.E1.m1.2.2.1.1.2.8">𝑡</ci><ci id="S3.E1.m1.2.2.1.1.2.9.cmml" xref="S3.E1.m1.2.2.1.1.2.9">𝑎</ci><ci id="S3.E1.m1.2.2.1.1.2.10.cmml" xref="S3.E1.m1.2.2.1.1.2.10">𝑏</ci><ci id="S3.E1.m1.2.2.1.1.2.11.cmml" xref="S3.E1.m1.2.2.1.1.2.11">𝑖</ci><ci id="S3.E1.m1.2.2.1.1.2.12.cmml" xref="S3.E1.m1.2.2.1.1.2.12">𝑙</ci><ci id="S3.E1.m1.2.2.1.1.2.13.cmml" xref="S3.E1.m1.2.2.1.1.2.13">𝑖</ci><ci id="S3.E1.m1.2.2.1.1.2.14.cmml" xref="S3.E1.m1.2.2.1.1.2.14">𝑡</ci><ci id="S3.E1.m1.2.2.1.1.2.15.cmml" xref="S3.E1.m1.2.2.1.1.2.15">𝑦</ci><interval closure="open" id="S3.E1.m1.2.2.1.1.2.2.3.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2"><apply id="S3.E1.m1.2.2.1.1.1.1.1.1.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.1.1.1.1.1.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1">subscript</csymbol><ci id="S3.E1.m1.2.2.1.1.1.1.1.1.2a.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.1.1.1.1.2.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.2">I</mtext></ci><apply id="S3.E1.m1.2.2.1.1.1.1.1.1.3.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3"><times id="S3.E1.m1.2.2.1.1.1.1.1.1.3.1.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.1"></times><ci id="S3.E1.m1.2.2.1.1.1.1.1.1.3.2.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.2">𝑠</ci><ci id="S3.E1.m1.2.2.1.1.1.1.1.1.3.3.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.3">𝑒</ci><ci id="S3.E1.m1.2.2.1.1.1.1.1.1.3.4.cmml" xref="S3.E1.m1.2.2.1.1.1.1.1.1.3.4">𝑔</ci></apply></apply><apply id="S3.E1.m1.2.2.1.1.2.2.2.2.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.2.2.2.2.1.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2">subscript</csymbol><ci id="S3.E1.m1.2.2.1.1.2.2.2.2.2a.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.2"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.2.2.2.2.2.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.2">w</mtext></ci><apply id="S3.E1.m1.2.2.1.1.2.2.2.2.3.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3"><times id="S3.E1.m1.2.2.1.1.2.2.2.2.3.1.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.1"></times><ci id="S3.E1.m1.2.2.1.1.2.2.2.2.3.2.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.2">𝑙</ci><ci id="S3.E1.m1.2.2.1.1.2.2.2.2.3.3.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.3">𝑜</ci><ci id="S3.E1.m1.2.2.1.1.2.2.2.2.3.4.cmml" xref="S3.E1.m1.2.2.1.1.2.2.2.2.3.4">𝑐</ci></apply></apply></interval></apply><apply id="S3.E1.m1.2.2.1.1.4.cmml" xref="S3.E1.m1.2.2.1.1.4"><apply id="S3.E1.m1.2.2.1.1.4.1.cmml" xref="S3.E1.m1.2.2.1.1.4.1"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.4.1.1.cmml" xref="S3.E1.m1.2.2.1.1.4.1">superscript</csymbol><apply id="S3.E1.m1.2.2.1.1.4.1.2.cmml" xref="S3.E1.m1.2.2.1.1.4.1"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.4.1.2.1.cmml" xref="S3.E1.m1.2.2.1.1.4.1">subscript</csymbol><sum id="S3.E1.m1.2.2.1.1.4.1.2.2.cmml" xref="S3.E1.m1.2.2.1.1.4.1.2.2"></sum><apply id="S3.E1.m1.2.2.1.1.4.1.2.3.cmml" xref="S3.E1.m1.2.2.1.1.4.1.2.3"><eq id="S3.E1.m1.2.2.1.1.4.1.2.3.1.cmml" xref="S3.E1.m1.2.2.1.1.4.1.2.3.1"></eq><ci id="S3.E1.m1.2.2.1.1.4.1.2.3.2.cmml" xref="S3.E1.m1.2.2.1.1.4.1.2.3.2">𝑘</ci><cn id="S3.E1.m1.2.2.1.1.4.1.2.3.3.cmml" type="integer" xref="S3.E1.m1.2.2.1.1.4.1.2.3.3">1</cn></apply></apply><ci id="S3.E1.m1.2.2.1.1.4.1.3.cmml" xref="S3.E1.m1.2.2.1.1.4.1.3">𝑛</ci></apply><apply id="S3.E1.m1.2.2.1.1.4.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2"><ci id="S3.E1.m1.2.2.1.1.4.2.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.1">⋅</ci><apply id="S3.E1.m1.2.2.1.1.4.2.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2"><times id="S3.E1.m1.2.2.1.1.4.2.2.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.1"></times><apply id="S3.E1.m1.2.2.1.1.4.2.2.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.4.2.2.2.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2">subscript</csymbol><ci id="S3.E1.m1.2.2.1.1.4.2.2.2.2a.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.2"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.4.2.2.2.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.2">I</mtext></ci><apply id="S3.E1.m1.2.2.1.1.4.2.2.2.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3"><times id="S3.E1.m1.2.2.1.1.4.2.2.2.3.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.1"></times><ci id="S3.E1.m1.2.2.1.1.4.2.2.2.3.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.2">𝑠</ci><ci id="S3.E1.m1.2.2.1.1.4.2.2.2.3.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.3">𝑒</ci><ci id="S3.E1.m1.2.2.1.1.4.2.2.2.3.4.cmml" xref="S3.E1.m1.2.2.1.1.4.2.2.2.3.4">𝑔</ci></apply></apply><ci id="S3.E1.m1.1.1.cmml" xref="S3.E1.m1.1.1">𝑘</ci></apply><apply id="S3.E1.m1.2.2.1.1.4.2.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.4.2.3.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3">subscript</csymbol><apply id="S3.E1.m1.2.2.1.1.4.2.3.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3"><csymbol cd="ambiguous" id="S3.E1.m1.2.2.1.1.4.2.3.2.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3">superscript</csymbol><ci id="S3.E1.m1.2.2.1.1.4.2.3.2.2a.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.2.2"><mtext class="ltx_mathvariant_bold" id="S3.E1.m1.2.2.1.1.4.2.3.2.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.2.2">w</mtext></ci><ci id="S3.E1.m1.2.2.1.1.4.2.3.2.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.2.3">𝑘</ci></apply><apply id="S3.E1.m1.2.2.1.1.4.2.3.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.3"><times id="S3.E1.m1.2.2.1.1.4.2.3.3.1.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.3.1"></times><ci id="S3.E1.m1.2.2.1.1.4.2.3.3.2.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.3.2">𝑙</ci><ci id="S3.E1.m1.2.2.1.1.4.2.3.3.3.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.3.3">𝑜</ci><ci id="S3.E1.m1.2.2.1.1.4.2.3.3.4.cmml" xref="S3.E1.m1.2.2.1.1.4.2.3.3.4">𝑐</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.E1.m1.2c">locatability(\textbf{I}_{seg},\textbf{w}_{loc})=\sum_{k=1}^{n}\textbf{I}_{seg}% (k)\cdot\textbf{w}^{k}_{loc},\vspace{-3mm}</annotation><annotation encoding="application/x-llamapun" id="S3.E1.m1.2d">italic_l italic_o italic_c italic_a italic_t italic_a italic_b italic_i italic_l italic_i italic_t italic_y ( I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT , w start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT ( italic_k ) ⋅ w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ,</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(1)</td> </tr></tbody> </table> </div> <div class="ltx_para" id="S3.SS1.p5"> where <math alttext="\textbf{I}_{seg}(k)" class="ltx_Math" display="inline" id="S3.SS1.p5.1.m1.1"><semantics id="S3.SS1.p5.1.m1.1a"><mrow id="S3.SS1.p5.1.m1.1.2" xref="S3.SS1.p5.1.m1.1.2.cmml"><msub id="S3.SS1.p5.1.m1.1.2.2" xref="S3.SS1.p5.1.m1.1.2.2.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p5.1.m1.1.2.2.2" xref="S3.SS1.p5.1.m1.1.2.2.2a.cmml">I</mtext><mrow id="S3.SS1.p5.1.m1.1.2.2.3" xref="S3.SS1.p5.1.m1.1.2.2.3.cmml"><mi id="S3.SS1.p5.1.m1.1.2.2.3.2" xref="S3.SS1.p5.1.m1.1.2.2.3.2.cmml">s</mi><mo id="S3.SS1.p5.1.m1.1.2.2.3.1" xref="S3.SS1.p5.1.m1.1.2.2.3.1.cmml">⁢</mo><mi id="S3.SS1.p5.1.m1.1.2.2.3.3" xref="S3.SS1.p5.1.m1.1.2.2.3.3.cmml">e</mi><mo id="S3.SS1.p5.1.m1.1.2.2.3.1a" xref="S3.SS1.p5.1.m1.1.2.2.3.1.cmml">⁢</mo><mi id="S3.SS1.p5.1.m1.1.2.2.3.4" xref="S3.SS1.p5.1.m1.1.2.2.3.4.cmml">g</mi></mrow></msub><mo id="S3.SS1.p5.1.m1.1.2.1" xref="S3.SS1.p5.1.m1.1.2.1.cmml">⁢</mo><mrow id="S3.SS1.p5.1.m1.1.2.3.2" xref="S3.SS1.p5.1.m1.1.2.cmml"><mo id="S3.SS1.p5.1.m1.1.2.3.2.1" stretchy="false" xref="S3.SS1.p5.1.m1.1.2.cmml">(</mo><mi id="S3.SS1.p5.1.m1.1.1" xref="S3.SS1.p5.1.m1.1.1.cmml">k</mi><mo id="S3.SS1.p5.1.m1.1.2.3.2.2" stretchy="false" xref="S3.SS1.p5.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p5.1.m1.1b"><apply id="S3.SS1.p5.1.m1.1.2.cmml" xref="S3.SS1.p5.1.m1.1.2"><times id="S3.SS1.p5.1.m1.1.2.1.cmml" xref="S3.SS1.p5.1.m1.1.2.1"></times><apply id="S3.SS1.p5.1.m1.1.2.2.cmml" xref="S3.SS1.p5.1.m1.1.2.2"><csymbol cd="ambiguous" id="S3.SS1.p5.1.m1.1.2.2.1.cmml" xref="S3.SS1.p5.1.m1.1.2.2">subscript</csymbol><ci id="S3.SS1.p5.1.m1.1.2.2.2a.cmml" xref="S3.SS1.p5.1.m1.1.2.2.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p5.1.m1.1.2.2.2.cmml" xref="S3.SS1.p5.1.m1.1.2.2.2">I</mtext></ci><apply id="S3.SS1.p5.1.m1.1.2.2.3.cmml" xref="S3.SS1.p5.1.m1.1.2.2.3"><times id="S3.SS1.p5.1.m1.1.2.2.3.1.cmml" xref="S3.SS1.p5.1.m1.1.2.2.3.1"></times><ci id="S3.SS1.p5.1.m1.1.2.2.3.2.cmml" xref="S3.SS1.p5.1.m1.1.2.2.3.2">𝑠</ci><ci id="S3.SS1.p5.1.m1.1.2.2.3.3.cmml" xref="S3.SS1.p5.1.m1.1.2.2.3.3">𝑒</ci><ci id="S3.SS1.p5.1.m1.1.2.2.3.4.cmml" xref="S3.SS1.p5.1.m1.1.2.2.3.4">𝑔</ci></apply></apply><ci id="S3.SS1.p5.1.m1.1.1.cmml" xref="S3.SS1.p5.1.m1.1.1">𝑘</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p5.1.m1.1c">\textbf{I}_{seg}(k)</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p5.1.m1.1d">I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT ( italic_k )</annotation></semantics></math> denotes pixel ratio of the <math alttext="k" class="ltx_Math" display="inline" id="S3.SS1.p5.2.m2.1"><semantics id="S3.SS1.p5.2.m2.1a"><mi id="S3.SS1.p5.2.m2.1.1" xref="S3.SS1.p5.2.m2.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.p5.2.m2.1b"><ci id="S3.SS1.p5.2.m2.1.1.cmml" xref="S3.SS1.p5.2.m2.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p5.2.m2.1c">k</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p5.2.m2.1d">italic_k</annotation></semantics></math>-th class in the segmentation mask <math alttext="\textbf{I}_{seg}" class="ltx_Math" display="inline" id="S3.SS1.p5.3.m3.1"><semantics id="S3.SS1.p5.3.m3.1a"><msub id="S3.SS1.p5.3.m3.1.1" xref="S3.SS1.p5.3.m3.1.1.cmml"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p5.3.m3.1.1.2" xref="S3.SS1.p5.3.m3.1.1.2a.cmml">I</mtext><mrow id="S3.SS1.p5.3.m3.1.1.3" xref="S3.SS1.p5.3.m3.1.1.3.cmml"><mi id="S3.SS1.p5.3.m3.1.1.3.2" xref="S3.SS1.p5.3.m3.1.1.3.2.cmml">s</mi><mo id="S3.SS1.p5.3.m3.1.1.3.1" xref="S3.SS1.p5.3.m3.1.1.3.1.cmml">⁢</mo><mi id="S3.SS1.p5.3.m3.1.1.3.3" xref="S3.SS1.p5.3.m3.1.1.3.3.cmml">e</mi><mo id="S3.SS1.p5.3.m3.1.1.3.1a" xref="S3.SS1.p5.3.m3.1.1.3.1.cmml">⁢</mo><mi id="S3.SS1.p5.3.m3.1.1.3.4" xref="S3.SS1.p5.3.m3.1.1.3.4.cmml">g</mi></mrow></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.p5.3.m3.1b"><apply id="S3.SS1.p5.3.m3.1.1.cmml" xref="S3.SS1.p5.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS1.p5.3.m3.1.1.1.cmml" xref="S3.SS1.p5.3.m3.1.1">subscript</csymbol><ci id="S3.SS1.p5.3.m3.1.1.2a.cmml" xref="S3.SS1.p5.3.m3.1.1.2"><mtext class="ltx_mathvariant_bold" id="S3.SS1.p5.3.m3.1.1.2.cmml" xref="S3.SS1.p5.3.m3.1.1.2">I</mtext></ci><apply id="S3.SS1.p5.3.m3.1.1.3.cmml" xref="S3.SS1.p5.3.m3.1.1.3"><times id="S3.SS1.p5.3.m3.1.1.3.1.cmml" xref="S3.SS1.p5.3.m3.1.1.3.1"></times><ci id="S3.SS1.p5.3.m3.1.1.3.2.cmml" xref="S3.SS1.p5.3.m3.1.1.3.2">𝑠</ci><ci id="S3.SS1.p5.3.m3.1.1.3.3.cmml" xref="S3.SS1.p5.3.m3.1.1.3.3">𝑒</ci><ci id="S3.SS1.p5.3.m3.1.1.3.4.cmml" xref="S3.SS1.p5.3.m3.1.1.3.4">𝑔</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p5.3.m3.1c">\textbf{I}_{seg}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p5.3.m3.1d">I start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT</annotation></semantics></math>. </div> <div class="ltx_para" id="S3.SS1.p6"> A higher <math alttext="locatability" class="ltx_Math" display="inline" id="S3.SS1.p6.1.m1.1"><semantics id="S3.SS1.p6.1.m1.1a"><mrow id="S3.SS1.p6.1.m1.1.1" xref="S3.SS1.p6.1.m1.1.1.cmml"><mi id="S3.SS1.p6.1.m1.1.1.2" xref="S3.SS1.p6.1.m1.1.1.2.cmml">l</mi><mo id="S3.SS1.p6.1.m1.1.1.1" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.3" xref="S3.SS1.p6.1.m1.1.1.3.cmml">o</mi><mo id="S3.SS1.p6.1.m1.1.1.1a" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.4" xref="S3.SS1.p6.1.m1.1.1.4.cmml">c</mi><mo id="S3.SS1.p6.1.m1.1.1.1b" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.5" xref="S3.SS1.p6.1.m1.1.1.5.cmml">a</mi><mo id="S3.SS1.p6.1.m1.1.1.1c" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.6" xref="S3.SS1.p6.1.m1.1.1.6.cmml">t</mi><mo id="S3.SS1.p6.1.m1.1.1.1d" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.7" xref="S3.SS1.p6.1.m1.1.1.7.cmml">a</mi><mo id="S3.SS1.p6.1.m1.1.1.1e" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.8" xref="S3.SS1.p6.1.m1.1.1.8.cmml">b</mi><mo id="S3.SS1.p6.1.m1.1.1.1f" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.9" xref="S3.SS1.p6.1.m1.1.1.9.cmml">i</mi><mo id="S3.SS1.p6.1.m1.1.1.1g" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.10" xref="S3.SS1.p6.1.m1.1.1.10.cmml">l</mi><mo id="S3.SS1.p6.1.m1.1.1.1h" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.11" xref="S3.SS1.p6.1.m1.1.1.11.cmml">i</mi><mo id="S3.SS1.p6.1.m1.1.1.1i" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.12" xref="S3.SS1.p6.1.m1.1.1.12.cmml">t</mi><mo id="S3.SS1.p6.1.m1.1.1.1j" xref="S3.SS1.p6.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.p6.1.m1.1.1.13" xref="S3.SS1.p6.1.m1.1.1.13.cmml">y</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p6.1.m1.1b"><apply id="S3.SS1.p6.1.m1.1.1.cmml" xref="S3.SS1.p6.1.m1.1.1"><times id="S3.SS1.p6.1.m1.1.1.1.cmml" xref="S3.SS1.p6.1.m1.1.1.1"></times><ci id="S3.SS1.p6.1.m1.1.1.2.cmml" xref="S3.SS1.p6.1.m1.1.1.2">𝑙</ci><ci id="S3.SS1.p6.1.m1.1.1.3.cmml" xref="S3.SS1.p6.1.m1.1.1.3">𝑜</ci><ci id="S3.SS1.p6.1.m1.1.1.4.cmml" xref="S3.SS1.p6.1.m1.1.1.4">𝑐</ci><ci id="S3.SS1.p6.1.m1.1.1.5.cmml" xref="S3.SS1.p6.1.m1.1.1.5">𝑎</ci><ci id="S3.SS1.p6.1.m1.1.1.6.cmml" xref="S3.SS1.p6.1.m1.1.1.6">𝑡</ci><ci id="S3.SS1.p6.1.m1.1.1.7.cmml" xref="S3.SS1.p6.1.m1.1.1.7">𝑎</ci><ci id="S3.SS1.p6.1.m1.1.1.8.cmml" xref="S3.SS1.p6.1.m1.1.1.8">𝑏</ci><ci id="S3.SS1.p6.1.m1.1.1.9.cmml" xref="S3.SS1.p6.1.m1.1.1.9">𝑖</ci><ci id="S3.SS1.p6.1.m1.1.1.10.cmml" xref="S3.SS1.p6.1.m1.1.1.10">𝑙</ci><ci id="S3.SS1.p6.1.m1.1.1.11.cmml" xref="S3.SS1.p6.1.m1.1.1.11">𝑖</ci><ci id="S3.SS1.p6.1.m1.1.1.12.cmml" xref="S3.SS1.p6.1.m1.1.1.12">𝑡</ci><ci id="S3.SS1.p6.1.m1.1.1.13.cmml" xref="S3.SS1.p6.1.m1.1.1.13">𝑦</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p6.1.m1.1c">locatability</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p6.1.m1.1d">italic_l italic_o italic_c italic_a italic_t italic_a italic_b italic_i italic_l italic_i italic_t italic_y</annotation></semantics></math> value indicates a higher degree of visual clues exhibited in a GSV image for geo-localization, while a lower value suggests the opposite. Empirically, we selected a threshold value of 0.4 for filtering locatable GSV images. This resulted in over 70k highly locatable images with geo-tags passing to the next stage for training an LVLM. </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> 3.2 Geo-localization with Reasoning</h3> <div class="ltx_para" id="S3.SS2.p1"> While many models (e.g., <cite class="ltx_cite ltx_citemacro_citet">Clark et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib8" title="">2023</a>); Pramanick et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib20" title="">2022</a>); Müller-Budack et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib19" title="">2018</a>); Seo et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib25" title="">2018</a>); Weyand et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib32" title="">2016</a>)</cite>) exist for image-based geo-localization, these models typically predict locations without providing the inference process. This introduces several limitations: First, the models operate as black boxes without providing insights, making it challenging for users to interpret. This obstacle impedes further refinement of the geo-localization model. More importantly, studies have demonstrated that integrating the reasoning process can enhance the capabilities of LLMs <cite class="ltx_cite ltx_citemacro_citep">(Qiao et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib21" title="">2023</a>)</cite>. Therefore, our objective is to construct an LVLM for image-based geo-localization with reasoning capability. </div> <div class="ltx_para" id="S3.SS2.p2"> Model Architecture. Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.F3" title="Figure 3 ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">3</a> illustrates the architecture of the proposed model GeoReasoner, which is based on Qwen-VL <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib2" title="">2023a</a>)</cite>. GeoReasoner consists of three modules: Vision Encoder, Vision-Language (VL) Adapter and Pre-trained LLM. Specifically, the Vision Encoder module employs the Vision Transformer (ViT) <cite class="ltx_cite ltx_citemacro_citep">(Dosovitskiy et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib10" title="">2021</a>)</cite> architecture. The input street-view images are resized to a specific resolution and then divided into a set of image patches. To refine image patches into sequential representations compatible with an LLM, the VL Adapter is introduced. In the VL Adapter, the sequence of visual features is initially condensed to a fixed length to address efficiency challenges posed by the substantial number of visual feature sequences. Subsequently, the processed visual features are integrated with the LLM using cross-attention mechanisms. Following this, the compressed visual feature sequence and text sequence are passed to the Pre-trained LLM module, which functions as a decoder for generating the answer. </div> <div class="ltx_para" id="S3.SS2.p3"> Supervised Fine-tuning. The overall model undergoes a staged pre-training process that is divided into two folds: reasoning tuning and location tuning. In the first stage, our objective is to enhance the model’s reasoning capability by utilizing textual clues paired with street-view images collected from geo-localization games. The input street-view image & question, and the output answer are formatted as prompts in the following manner: </div> <div class="ltx_para" id="S3.SS2.p4"> <img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_landscape" height="172" id="S3.SS2.p4.1.1.g1" src="x4.png" width="747"/> </div> <div class="ltx_para" id="S3.SS2.p5"> Here, we can only provide reasoning at the country level due to the granularity exhibited in the image-text pairs. Nevertheless, this reasoning procedure is sufficient to facilitate the second stage of location tuning. Next, we integrate the prior knowledge of country information with highly locatable GSV images with geo-tags to infer the fine-grained city-level location information. We utilize a similar prompt format as in the first stage but without a reasoning requirement. Both stages are fine-tuned from the pre-trained Qwen-VL with LoRA, which contributes to the overall performance improvement of Qwen-VL in both the reasoning and location tuning stages, allowing the model to better capture complex relationships within the image-text pairs. </div> <figure class="ltx_figure" id="S3.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="607" id="S3.F4.1.g1" src="x5.png" width="789"/> <figcaption class="ltx_caption ltx_centering">Figure 4: Locatability examples. Top row: the street views are highly locatable by signboards, architectural styles, and landmarks. Bottom row: no visual clues for locating the street views.</figcaption> </figure> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> 4 Experiments</h2> <div class="ltx_para" id="S4.p1"> We conduct a series of experiments to evaluate the effectiveness of the locatability-enhanced geo-localization dataset (Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS1" title="4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.1</a>) and the model GeoReasoner for geo-localization with reasoning (Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2" title="4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.2</a>). </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> 4.1 Experiments on Locatability-Enhanced Dataset</h3> <figure class="ltx_figure" id="S4.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="278" id="S4.F5.1.g1" src="extracted/5933322/imgs/fig_boxplot_v2.png" width="527"/> <figcaption class="ltx_caption ltx_centering">Figure 5: The relationship between building proportion and the degree of locatability in street views. The locatability metric peaks when the building proportion is approximately 0.2.</figcaption> </figure> <section class="ltx_subsubsection" id="S4.SS1.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> 4.1.1 Qualitative Comparison</h4> <div class="ltx_para" id="S4.SS1.SSS1.p1"> Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.F4" title="Figure 4 ‣ 3.2 Geo-localization with Reasoning ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4</a> presents examples of the predicted locatability degrees of different street-view images by our locatability quantization network. The top row showcases street views distinguished by prominent localizable attributes. The left image features the Korean language on a signboard, the middle image captures the distinctive Art Nouveau architectural style commonly found in Switzerland, and the right image shows an art & design museum in India. In contrast, street views in the bottom row display lower locatability degrees. The left image resembles a tunnel, lacking additional discernible information for accurate localization. Similarly, the middle image is occluded by a wall, and the right image faces common vegetation that is available worldwide. </div> <figure class="ltx_figure" id="S4.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="282" id="S4.F6.1.g1" src="extracted/5933322/imgs/fig5_acc_improve.png" width="568"/> <figcaption class="ltx_caption ltx_centering">Figure 6: Quantitative comparison of country- and city-level geo-localization accuracy by different models trained on mixed datasets with varying proportions of high locatable GSV images.</figcaption> </figure> <figure class="ltx_figure" id="S4.F7"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="549" id="S4.F7.1.g1" src="x6.png" width="813"/> <figcaption class="ltx_caption ltx_centering">Figure 7: Examples of LVLM-based approaches in geo-localization with reasoning. Prediction results matching the ground truth are highlighted in green, while reasons offering valid information are marked in blue.</figcaption> </figure> <div class="ltx_para" id="S4.SS1.SSS1.p2"> For the proposed locatability metric in Equation <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S3.E1" title="Equation 1 ‣ 3.1 Locatability-Enhanced Data Curation ‣ 3 GeoReasoner ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">1</a>, we also evaluated the relationship between building proportion and the degree of locatability of street views. The results are shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F5" title="Figure 5 ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">5</a>. The locatability metric slightly increases as the building proportion ranges from 0 to 0.2, but decreases as the building proportion continues to increase. The results indicate that buildings are not the sole determinant of locatability. As the proportion of buildings increases, the street-view images transition from panoramic to close-up views, leading to reduced information availability and consequently diminishing the degree of locatability. </div> <div class="ltx_para" id="S4.SS1.SSS1.p3"> The qualitative analysis indicates the effectiveness of the locatability quantization network in predicting locatability degrees of street-view images. Furthermore, the prediction aligns with human inference knowledge harvested from real geo-localization games, providing the ground truths for fine-tuning the reasoning component in GeoReasoner. </div> </section> <section class="ltx_subsubsection" id="S4.SS1.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> 4.1.2 Quantitative Comparison</h4> <div class="ltx_para" id="S4.SS1.SSS2.p1"> We conducted quantitative experiments to investigate the importance of using high-locatability GSV images in training the location component in GeoReasoner. Various datasets were prepared, featuring different proportions of high-locatability GSV images, ranging from 0% (only low-locatability GSV images) to 100% (only high-locatability GSV images). To ensure fairness, each experimental group retained consistent 10K GSV images, with only the proportion of high-locatability images varying. Subsequently, models were trained for each dataset, and their accuracy in country- and city-level geo-localization was evaluated on a randomly sampled set of 1K GSV images. </div> <div class="ltx_para" id="S4.SS1.SSS2.p2"> The experimental results are presented in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F6" title="Figure 6 ‣ 4.1.1 Qualitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">6</a>. Overall, the results reveal that as the proportion of high-locatability GSV images in the training dataset increases, the performance of the fine-tuned location component improves in both country- and city-level geo-localization. Specifically, the country- and city-level geo-localization accuracy increases from 0.63 & 0.47 for 0% high-locatability GSV images, to 0.72 & 0.51 for 100% high-locatability GSV images. Notably, the experiments only utilize 10K GSV images instead of all the curated 70K high-locatability GSV images due to training complexity. Nevertheless, the results demonstrate that high-locatability GSV images offer more meaningful insights and less extraneous noise, making them highly valuable in the geo-localization task. </div> <figure class="ltx_table" id="S4.T1"> <figcaption class="ltx_caption" style="font-size:80%;">Table 1: Comparison of Precision, Recall and F1 scores in country-level and city-level geo-localization. * represents the model trained on high-locatability GSV images.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S4.T1.6.6"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S4.T1.6.6.7.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_tt" id="S4.T1.6.6.7.1.1" rowspan="2">Model</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="3" id="S4.T1.6.6.7.1.2">Country</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="3" id="S4.T1.6.6.7.1.3">City</th> </tr> <tr class="ltx_tr" id="S4.T1.6.6.6"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.1.1.1.1"> Accuracy<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.1.1.1.1.m1.1"><semantics id="S4.T1.1.1.1.1.m1.1a"><mo id="S4.T1.1.1.1.1.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.1.1.1.1.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.1.1.1.1.m1.1b"><ci id="S4.T1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.1.1.1.1.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.1.1.1.1.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.1.1.1.1.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.2.2.2.2"> Recall<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.2.2.2.2.m1.1"><semantics id="S4.T1.2.2.2.2.m1.1a"><mo id="S4.T1.2.2.2.2.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.2.2.2.2.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.2.2.2.2.m1.1b"><ci id="S4.T1.2.2.2.2.m1.1.1.cmml" xref="S4.T1.2.2.2.2.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.2.2.2.2.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.2.2.2.2.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.3.3.3.3"> F1<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.3.3.3.3.m1.1"><semantics id="S4.T1.3.3.3.3.m1.1a"><mo id="S4.T1.3.3.3.3.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.3.3.3.3.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.3.3.3.3.m1.1b"><ci id="S4.T1.3.3.3.3.m1.1.1.cmml" xref="S4.T1.3.3.3.3.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.3.3.3.3.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.3.3.3.3.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.4.4.4.4"> Accuracy<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.4.4.4.4.m1.1"><semantics id="S4.T1.4.4.4.4.m1.1a"><mo id="S4.T1.4.4.4.4.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.4.4.4.4.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.4.4.4.4.m1.1b"><ci id="S4.T1.4.4.4.4.m1.1.1.cmml" xref="S4.T1.4.4.4.4.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.4.4.4.4.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.4.4.4.4.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.5.5.5.5"> Recall<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.5.5.5.5.m1.1"><semantics id="S4.T1.5.5.5.5.m1.1a"><mo id="S4.T1.5.5.5.5.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.5.5.5.5.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.5.5.5.5.m1.1b"><ci id="S4.T1.5.5.5.5.m1.1.1.cmml" xref="S4.T1.5.5.5.5.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.5.5.5.5.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.5.5.5.5.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T1.6.6.6.6"> F1<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T1.6.6.6.6.m1.1"><semantics id="S4.T1.6.6.6.6.m1.1a"><mo id="S4.T1.6.6.6.6.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T1.6.6.6.6.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T1.6.6.6.6.m1.1b"><ci id="S4.T1.6.6.6.6.m1.1.1.cmml" xref="S4.T1.6.6.6.6.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.6.6.6.6.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T1.6.6.6.6.m1.1d">↑</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T1.6.6.8.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T1.6.6.8.1.1"> StreetCLIP <cite class="ltx_cite ltx_citemacro_citep">(Haas et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib11" title="">2023</a>)</cite> </th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.2">0.7943</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.3">1.00</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.4">0.8854</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.5">0.7457</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.6">1.00</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.8.1.7">0.8543</td> </tr> <tr class="ltx_tr" id="S4.T1.6.6.9.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T1.6.6.9.2.1"> LLaVA <cite class="ltx_cite ltx_citemacro_citep">(Liu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib17" title="">2024</a>)</cite> </th> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.2">0.4029</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.3">1.00</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.4">0.5744</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.5">0.2400</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.6">1.00</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.9.2.7">0.3871</td> </tr> <tr class="ltx_tr" id="S4.T1.6.6.10.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T1.6.6.10.3.1"> Qwen-VL (Qwen-7B) <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib2" title="">2023a</a>)</cite> </th> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.2">0.5829</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.3">0.95</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.4">0.7225</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.5">0.3743</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.6">0.89</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.10.3.7">0.5270</td> </tr> <tr class="ltx_tr" id="S4.T1.6.6.11.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T1.6.6.11.4.1"> GPT-4V <cite class="ltx_cite ltx_citemacro_citep">(Achiam et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib1" title="">2023</a>)</cite> </th> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.2">0.8917</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.3">0.34</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.4">0.4923</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.5">0.5083</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.6">0.31</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.11.4.7">0.3851</td> </tr> <tr class="ltx_tr" id="S4.T1.6.6.12.5"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T1.6.6.12.5.1"> ViT* <cite class="ltx_cite ltx_citemacro_citep">(Dosovitskiy et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib10" title="">2021</a>)</cite> </th> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.2">0.7100</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.3">1.00</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.4">0.8304</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.5">0.6762</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.6">1.00</td> <td class="ltx_td ltx_align_center" id="S4.T1.6.6.12.5.7">0.8068</td> </tr> <tr class="ltx_tr" id="S4.T1.6.6.13.6"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb" id="S4.T1.6.6.13.6.1">GeoReasoner*</th> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.2">0.8237</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.3">1.00</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.4">0.9033</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.5">0.7521</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.6">1.00</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.13.6.7">0.8585</td> </tr> </tbody> </table> </figure> </section> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> 4.2 Experiments on Geo-localization with Reasoning</h3> <section class="ltx_subsubsection" id="S4.SS2.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> 4.2.1 Qualitative Comparison with SOTA</h4> <div class="ltx_para" id="S4.SS2.SSS1.p1"> To assess the efficacy of GeoReasoner in terms of geo-localization with reasoning, we conduct a qualitative comparison with state-of-the-art LVLM-based approaches, including LLaVA <cite class="ltx_cite ltx_citemacro_citep">(Liu et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib17" title="">2024</a>)</cite>, Qwen-VL (Qwen-7B) <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib2" title="">2023a</a>)</cite>, and GPT-4V <cite class="ltx_cite ltx_citemacro_citep">(Achiam et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib1" title="">2023</a>)</cite>. In the experimental phase, we presented the same input street-view images, reasoning process, and result formats to these models. Specifically, a consistent prompt is used, as below: </div> <div class="ltx_para" id="S4.SS2.SSS1.p2"> According to the content of the image, please think step by step and deduce in which country and city the image is most likely located and offer possible explanations. Output in JSON format, e.g., {‘country’: ‘’, ‘city’: ‘’, ‘reasons’:‘’}. </div> <div class="ltx_para" id="S4.SS2.SSS1.p3"> Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F7" title="Figure 7 ‣ 4.1.1 Qualitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">7</a> illustrates the inference results of counterpart models and GeoReasoner on three diverse street views from different countries and cities—namely, Singapore-Singapore (top), United States-Las Vegas (middle) and China-Lhasa (bottom). Overall, GeoReasoner not only outperforms existing models in the accuracy of country or city-level predictions but also provides coherent explanations with insightful reasoning for the inference results. </div> <div class="ltx_para" id="S4.SS2.SSS1.p4"> In Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F7" title="Figure 7 ‣ 4.1.1 Qualitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">7</a> (top), GeoReasoner identifies the word ‘COMFORT’ on the taxi in the image. Drawing from prior knowledge, ‘the ComfortDelGro taxi is a distinctive symbol of Singapore’s public transportation system’ in the text-image pairs, the model deduces that the area is likely to be in Singapore. GPT-4V predicts the same geo-location with accurate reasoning, yet the other two models fail, either due to not recognizing the taxi by LLaVA or making an incorrect inference about the city by Qwen-VL. </div> <div class="ltx_para" id="S4.SS2.SSS1.p5"> Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F7" title="Figure 7 ‣ 4.1.1 Qualitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">7</a> (middle) presents a scene of the Las Vegas Strip. A conspicuous ‘NEW YORK’ sign is prominently visible in the upper-left corner of the image. This sign causes the reasoning error in the task performed by LLaVA. Although Qwen-VL generates accurate predictions of Las Vegas-United States, the most essential factor, i.e., ‘Las Vegas Strip’, is not considered in the reasoning process. In contrast, both GeoReasoner and GPT-4V provide the correct geo-location along with accurate inference. </div> <div class="ltx_para" id="S4.SS2.SSS1.p6"> Based on the depiction of Chinese characters and traditional clothing in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.F7" title="Figure 7 ‣ 4.1.1 Qualitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">7</a> (right), all models make accurate predictions regarding the country, identifying it as China. However, LLaVA makes an incorrect prediction of the city, specifying Beijing. In contrast, the other models successfully predict the city as Lhasa, providing sensible and justifiable reasons for their inferences. </div> </section> <section class="ltx_subsubsection" id="S4.SS2.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> 4.2.2 Quantitative Comparison with SOTA</h4> <div class="ltx_para" id="S4.SS2.SSS2.p1"> We further conduct quantitative experiments to compare with counterparts LVLMs. In addition, we choose StreetCLIP <cite class="ltx_cite ltx_citemacro_citep">(Haas et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib11" title="">2023</a>)</cite> as the state-of-the-art classification-based approach and omit retrieval-based approaches relying on a geo-tagged image gallery that is not available. It is important to clarify that, for the LVLM-based approaches, obtaining corresponding and relevant answers is not guaranteed at all times. Therefore, we included Recall rate to measure the proportion of effective answers within the large language models. When calculating the Accuracy rate, only the accuracy of these effective answers is taken into account. We additionally compute F1 values, taking into consideration both Accuracy and Recall metrics. </div> <div class="ltx_para" id="S4.SS2.SSS2.p2"> Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T1" title="Table 1 ‣ 4.1.2 Quantitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">1</a> presents the prediction results by the counterparts and GeoReasoner. Overall, GeoReasoner outperforms all the counterparts, particularly those LVLM-based approaches. Taking the best performed Qwen-VL for example, GeoReasoner outperforms it 25.02% on country-level geo-localization and 38.61% on city-level geo-localization, in terms of F1 value. Surprisingly, the recall performance of GPT-4V for the geo-localization task was notably low. Most of the responses were mainly: ‘I’m sorry, I can’t provide assistance with that request.’ or ‘I’m sorry, but I am unable to provide the exact location, such as the country and city, for the image you have provided. My capabilities do not include analyzing specific details to determine the geographical location of the image content.’ </div> <figure class="ltx_table" id="S4.T2"> <figcaption class="ltx_caption" style="font-size:70%;">Table 2: Results of the ablation experiments using baseline Qwen-VL (Qwen-7B), GeoReasoner w/o location tuning, GeoReasoner w/o reasoning tuning, and the full GeoReasoner models.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S4.T2.8.8"> <thead class="ltx_thead"> <tr class="ltx_tr" id="S4.T2.8.8.9.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_tt" id="S4.T2.8.8.9.1.1" rowspan="3">Model</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="2" id="S4.T2.8.8.9.1.2">Training</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="6" id="S4.T2.8.8.9.1.3">Performance</th> </tr> <tr class="ltx_tr" id="S4.T2.8.8.10.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.8.8.10.2.1" rowspan="2">Reasoning</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.8.8.10.2.2" rowspan="2">Location</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" colspan="3" id="S4.T2.8.8.10.2.3">Country</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" colspan="3" id="S4.T2.8.8.10.2.4">City</th> </tr> <tr class="ltx_tr" id="S4.T2.6.6.6"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.1.1.1.1"> Accuracy<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.1.1.1.1.m1.1"><semantics id="S4.T2.1.1.1.1.m1.1a"><mo id="S4.T2.1.1.1.1.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.1.1.1.1.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.1.1.1.1.m1.1b"><ci id="S4.T2.1.1.1.1.m1.1.1.cmml" xref="S4.T2.1.1.1.1.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.1.1.1.1.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.1.1.1.1.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.2.2.2.2"> Recall<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.2.2.2.2.m1.1"><semantics id="S4.T2.2.2.2.2.m1.1a"><mo id="S4.T2.2.2.2.2.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.2.2.2.2.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.2.2.2.2.m1.1b"><ci id="S4.T2.2.2.2.2.m1.1.1.cmml" xref="S4.T2.2.2.2.2.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.2.2.2.2.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.2.2.2.2.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.3.3.3.3"> F1<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.3.3.3.3.m1.1"><semantics id="S4.T2.3.3.3.3.m1.1a"><mo id="S4.T2.3.3.3.3.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.3.3.3.3.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.3.3.3.3.m1.1b"><ci id="S4.T2.3.3.3.3.m1.1.1.cmml" xref="S4.T2.3.3.3.3.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.3.3.3.3.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.3.3.3.3.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.4.4.4.4"> Accuracy<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.4.4.4.4.m1.1"><semantics id="S4.T2.4.4.4.4.m1.1a"><mo id="S4.T2.4.4.4.4.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.4.4.4.4.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.4.4.4.4.m1.1b"><ci id="S4.T2.4.4.4.4.m1.1.1.cmml" xref="S4.T2.4.4.4.4.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.4.4.4.4.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.4.4.4.4.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.5.5.5.5"> Recall<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.5.5.5.5.m1.1"><semantics id="S4.T2.5.5.5.5.m1.1a"><mo id="S4.T2.5.5.5.5.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.5.5.5.5.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.5.5.5.5.m1.1b"><ci id="S4.T2.5.5.5.5.m1.1.1.cmml" xref="S4.T2.5.5.5.5.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.5.5.5.5.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.5.5.5.5.m1.1d">↑</annotation></semantics></math> </th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="S4.T2.6.6.6.6"> F1<math alttext="\uparrow" class="ltx_Math" display="inline" id="S4.T2.6.6.6.6.m1.1"><semantics id="S4.T2.6.6.6.6.m1.1a"><mo id="S4.T2.6.6.6.6.m1.1.1" mathsize="80%" stretchy="false" xref="S4.T2.6.6.6.6.m1.1.1.cmml">↑</mo><annotation-xml encoding="MathML-Content" id="S4.T2.6.6.6.6.m1.1b"><ci id="S4.T2.6.6.6.6.m1.1.1.cmml" xref="S4.T2.6.6.6.6.m1.1.1">↑</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.6.6.6.6.m1.1c">\uparrow</annotation><annotation encoding="application/x-llamapun" id="S4.T2.6.6.6.6.m1.1d">↑</annotation></semantics></math> </th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T2.8.8.11.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T2.8.8.11.1.1">Qwen-VL (Qwen-7B)</th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.2">-</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.3">-</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.4">0.5829</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.5">0.95</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.6">0.7225</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.7">0.3743</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.8">0.89</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.8.8.11.1.9">0.5270</td> </tr> <tr class="ltx_tr" id="S4.T2.7.7.7"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T2.7.7.7.2">GeoReasoner w/o location tuning</th> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.3">✓</td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T2.7.7.7.1.m1.1"><semantics id="S4.T2.7.7.7.1.m1.1a"><mo id="S4.T2.7.7.7.1.m1.1.1" mathsize="80%" xref="S4.T2.7.7.7.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T2.7.7.7.1.m1.1b"><times id="S4.T2.7.7.7.1.m1.1.1.cmml" xref="S4.T2.7.7.7.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.7.7.7.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T2.7.7.7.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.4">0.6971</td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.5">1.00</td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.6">0.8215</td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.7">0.4114</td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.8">0.99</td> <td class="ltx_td ltx_align_center" id="S4.T2.7.7.7.9">0.5813</td> </tr> <tr class="ltx_tr" id="S4.T2.8.8.8"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T2.8.8.8.2">GeoReasoner w/o reasoning tuning</th> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T2.8.8.8.1.m1.1"><semantics id="S4.T2.8.8.8.1.m1.1a"><mo id="S4.T2.8.8.8.1.m1.1.1" mathsize="80%" xref="S4.T2.8.8.8.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T2.8.8.8.1.m1.1b"><times id="S4.T2.8.8.8.1.m1.1.1.cmml" xref="S4.T2.8.8.8.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.8.8.8.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T2.8.8.8.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.3">✓</td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.4">0.7803</td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.5">1.00</td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.6">0.8766</td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.7">0.7029</td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.8">1.00</td> <td class="ltx_td ltx_align_center" id="S4.T2.8.8.8.9">0.8255</td> </tr> <tr class="ltx_tr" id="S4.T2.8.8.12.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb" id="S4.T2.8.8.12.2.1">GeoReasoner</th> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.2">✓</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.3">✓</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.4">0.8237</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.5">1.00</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.6">0.9033</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.7">0.7521</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.8">1.00</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.8.8.12.2.9">0.8584</td> </tr> </tbody> </table> </figure> <div class="ltx_para" id="S4.SS2.SSS2.p3"> We speculate that GPT-4V has undergone extensive measures to ensure the model’s security and privacy, which may contribute to its reluctance or denial of recognition in the task of geo-localization. </div> <div class="ltx_para" id="S4.SS2.SSS2.p4"> In comparison to StreetCLIP that is specialized in geo-localization, GeoReasoner demonstrates only a slight superiority. Nevertheless, it’s important to note that StreetCLIP was trained on a significantly larger dataset of over 1.1 million street-view images, while our GeoReasoner was trained with only 70K street views. For ViT trained on the same data, GeoReasoner still exhibits superior geolocation capabilities. Moreover, GeoReasoner offers reasoning capability, providing added value for various downstream tasks. </div> <figure class="ltx_table" id="S4.T3"> <figcaption class="ltx_caption" style="font-size:80%;">Table 3: Comparison results on Im2GPS dataset. The top five rows are derived from the results reported in the paper, while the last four rows are from retesting on the filtered Im2GPS dataset, which includes only highly locatable data.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S4.T3.13.13"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T3.13.13.14.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_tt" id="S4.T3.13.13.14.1.1" rowspan="2">Model</th> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="S4.T3.13.13.14.1.2">Dataset w/ Filter</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T3.13.13.14.1.3">Street</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T3.13.13.14.1.4">City</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T3.13.13.14.1.5">Country</td> </tr> <tr class="ltx_tr" id="S4.T3.13.13.15.2"> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.15.2.1">Train</td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.15.2.2">Test</td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.15.2.3">1km</td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.15.2.4">25km</td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.15.2.5">750km</td> </tr> <tr class="ltx_tr" id="S4.T3.2.2.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T3.2.2.2.3">PlaNet</th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.1.1.1.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.1.1.1.1.m1.1"><semantics id="S4.T3.1.1.1.1.m1.1a"><mo id="S4.T3.1.1.1.1.m1.1.1" mathsize="80%" xref="S4.T3.1.1.1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.1.1.1.1.m1.1b"><times id="S4.T3.1.1.1.1.m1.1.1.cmml" xref="S4.T3.1.1.1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.1.1.1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.1.1.1.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.2.2.2.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.2.2.2.2.m1.1"><semantics id="S4.T3.2.2.2.2.m1.1a"><mo id="S4.T3.2.2.2.2.m1.1.1" mathsize="80%" xref="S4.T3.2.2.2.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.2.2.2.2.m1.1b"><times id="S4.T3.2.2.2.2.m1.1.1.cmml" xref="S4.T3.2.2.2.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.2.2.2.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.2.2.2.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.2.2.2.4">0.08</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.2.2.2.5">0.25</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.2.2.2.6">0.54</td> </tr> <tr class="ltx_tr" id="S4.T3.4.4.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.4.4.4.3">CPlaNet</th> <td class="ltx_td ltx_align_center" id="S4.T3.3.3.3.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.3.3.3.1.m1.1"><semantics id="S4.T3.3.3.3.1.m1.1a"><mo id="S4.T3.3.3.3.1.m1.1.1" mathsize="80%" xref="S4.T3.3.3.3.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.3.3.3.1.m1.1b"><times id="S4.T3.3.3.3.1.m1.1.1.cmml" xref="S4.T3.3.3.3.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.3.3.3.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.3.3.3.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.4.4.4.2.m1.1"><semantics id="S4.T3.4.4.4.2.m1.1a"><mo id="S4.T3.4.4.4.2.m1.1.1" mathsize="80%" xref="S4.T3.4.4.4.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.4.4.4.2.m1.1b"><times id="S4.T3.4.4.4.2.m1.1.1.cmml" xref="S4.T3.4.4.4.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.4.4.4.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.4.4.4.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.4">0.17</td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.5">0.37</td> <td class="ltx_td ltx_align_center" id="S4.T3.4.4.4.6">0.62</td> </tr> <tr class="ltx_tr" id="S4.T3.6.6.6"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.6.6.6.3">ISNs</th> <td class="ltx_td ltx_align_center" id="S4.T3.5.5.5.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.5.5.5.1.m1.1"><semantics id="S4.T3.5.5.5.1.m1.1a"><mo id="S4.T3.5.5.5.1.m1.1.1" mathsize="80%" xref="S4.T3.5.5.5.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.5.5.5.1.m1.1b"><times id="S4.T3.5.5.5.1.m1.1.1.cmml" xref="S4.T3.5.5.5.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.5.5.5.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.5.5.5.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.6.6.6.2.m1.1"><semantics id="S4.T3.6.6.6.2.m1.1a"><mo id="S4.T3.6.6.6.2.m1.1.1" mathsize="80%" xref="S4.T3.6.6.6.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.6.6.6.2.m1.1b"><times id="S4.T3.6.6.6.2.m1.1.1.cmml" xref="S4.T3.6.6.6.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.6.6.6.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.6.6.6.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.4">0.17</td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.5">0.43</td> <td class="ltx_td ltx_align_center" id="S4.T3.6.6.6.6">0.67</td> </tr> <tr class="ltx_tr" id="S4.T3.8.8.8"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.8.8.8.3">Translocator</th> <td class="ltx_td ltx_align_center" id="S4.T3.7.7.7.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.7.7.7.1.m1.1"><semantics id="S4.T3.7.7.7.1.m1.1a"><mo id="S4.T3.7.7.7.1.m1.1.1" mathsize="80%" xref="S4.T3.7.7.7.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.7.7.7.1.m1.1b"><times id="S4.T3.7.7.7.1.m1.1.1.cmml" xref="S4.T3.7.7.7.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.7.7.7.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.7.7.7.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.8.8.8.2.m1.1"><semantics id="S4.T3.8.8.8.2.m1.1a"><mo id="S4.T3.8.8.8.2.m1.1.1" mathsize="80%" xref="S4.T3.8.8.8.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.8.8.8.2.m1.1b"><times id="S4.T3.8.8.8.2.m1.1.1.cmml" xref="S4.T3.8.8.8.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.8.8.8.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.8.8.8.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.4">0.20</td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.5">0.48</td> <td class="ltx_td ltx_align_center" id="S4.T3.8.8.8.6">0.76</td> </tr> <tr class="ltx_tr" id="S4.T3.10.10.10"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.10.10.10.3">GeoDecoder</th> <td class="ltx_td ltx_align_center" id="S4.T3.9.9.9.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.9.9.9.1.m1.1"><semantics id="S4.T3.9.9.9.1.m1.1a"><mo id="S4.T3.9.9.9.1.m1.1.1" mathsize="80%" xref="S4.T3.9.9.9.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.9.9.9.1.m1.1b"><times id="S4.T3.9.9.9.1.m1.1.1.cmml" xref="S4.T3.9.9.9.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.9.9.9.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.9.9.9.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.10.10.10.2.m1.1"><semantics id="S4.T3.10.10.10.2.m1.1a"><mo id="S4.T3.10.10.10.2.m1.1.1" mathsize="80%" xref="S4.T3.10.10.10.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.10.10.10.2.m1.1b"><times id="S4.T3.10.10.10.2.m1.1.1.cmml" xref="S4.T3.10.10.10.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.10.10.10.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.10.10.10.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.4">0.22</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.5">0.50</td> <td class="ltx_td ltx_align_center" id="S4.T3.10.10.10.6">0.80</td> </tr> <tr class="ltx_tr" id="S4.T3.11.11.11"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T3.11.11.11.2">ISNs</th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.11.11.11.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.11.11.11.1.m1.1"><semantics id="S4.T3.11.11.11.1.m1.1a"><mo id="S4.T3.11.11.11.1.m1.1.1" mathsize="80%" xref="S4.T3.11.11.11.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.11.11.11.1.m1.1b"><times id="S4.T3.11.11.11.1.m1.1.1.cmml" xref="S4.T3.11.11.11.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.11.11.11.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.11.11.11.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.11.11.11.3">✓</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.11.11.11.4">0.25</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.11.11.11.5">0.43</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T3.11.11.11.6">0.78</td> </tr> <tr class="ltx_tr" id="S4.T3.12.12.12"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.12.12.12.2">GeoCLIP</th> <td class="ltx_td ltx_align_center" id="S4.T3.12.12.12.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.12.12.12.1.m1.1"><semantics id="S4.T3.12.12.12.1.m1.1a"><mo id="S4.T3.12.12.12.1.m1.1.1" mathsize="80%" xref="S4.T3.12.12.12.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.12.12.12.1.m1.1b"><times id="S4.T3.12.12.12.1.m1.1.1.cmml" xref="S4.T3.12.12.12.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.12.12.12.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.12.12.12.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.12.12.12.3">✓</td> <td class="ltx_td ltx_align_center" id="S4.T3.12.12.12.4">0.25</td> <td class="ltx_td ltx_align_center" id="S4.T3.12.12.12.5">0.49</td> <td class="ltx_td ltx_align_center" id="S4.T3.12.12.12.6">0.87</td> </tr> <tr class="ltx_tr" id="S4.T3.13.13.13"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T3.13.13.13.2">GeoReasoner</th> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.13.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T3.13.13.13.1.m1.1"><semantics id="S4.T3.13.13.13.1.m1.1a"><mo id="S4.T3.13.13.13.1.m1.1.1" mathsize="80%" xref="S4.T3.13.13.13.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T3.13.13.13.1.m1.1b"><times id="S4.T3.13.13.13.1.m1.1.1.cmml" xref="S4.T3.13.13.13.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T3.13.13.13.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T3.13.13.13.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.13.3">✓</td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.13.4">0.10</td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.13.5">0.41</td> <td class="ltx_td ltx_align_center" id="S4.T3.13.13.13.6">0.82</td> </tr> <tr class="ltx_tr" id="S4.T3.13.13.16.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb" id="S4.T3.13.13.16.3.1">GeoReasoner</th> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T3.13.13.16.3.2">✓</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T3.13.13.16.3.3">✓</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T3.13.13.16.3.4">0.13</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T3.13.13.16.3.5">0.44</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T3.13.13.16.3.6">0.86</td> </tr> </tbody> </table> </figure> </section> <section class="ltx_subsubsection" id="S4.SS2.SSS3"> <h4 class="ltx_title ltx_title_subsubsection"> 4.2.3 Ablation Experiments</h4> <div class="ltx_para" id="S4.SS2.SSS3.p1"> To assess the contributions of the location tuning and reasoning tuning components in GeoReasoner, we design several ablation experiments using the Qwen-VL <cite class="ltx_cite ltx_citemacro_citep">(Bai et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib2" title="">2023a</a>)</cite> pre-trained model as the baseline. Next, we integrated the Qwen-VL pre-trained model with LoRA1 (GeoReasoner without location tuning) and LoRA2 (GeoReasoner without reasoning tuning). The last experiment involved the full GeoReasoner model, including both the location tuning and reasoning tuning components. The same prompts were utilized for all these models, as in the previous experiments. </div> <div class="ltx_para" id="S4.SS2.SSS3.p2"> Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T2" title="Table 2 ‣ 4.2.2 Quantitative Comparison with SOTA ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">2</a> presents the quantitative results in terms of accuracy, recall, and F1. Overall, the results indicate that both the location tuning and reasoning tuning components improve the model performance. Specifically, the location tuning component is essential for geo-localization, as GeoReasoner w/o reasoning tuning (row 3) achieves much higher accuracy than GeoReasoner w/o location tuning (row 2), especially for fine-grained city-level prediction. This result further strengthens the evidence that high-locatability GSV images are essential for geo-localization. The reasoning tuning component also plays a significant role in the performance improvement, as evidenced by the superior performance of the full GeoReasoner (row 4). </div> <figure class="ltx_table" id="S4.T4"> <figcaption class="ltx_caption" style="font-size:80%;">Table 4: Comparison results on Im2GPS3k dataset. The top six rows are derived from the results reported in the paper, while the last four rows are from retesting on the filtered Im2GPS3k dataset, which includes only highly locatable data.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="S4.T4.15.15"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="S4.T4.15.15.16.1"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_tt" id="S4.T4.15.15.16.1.1" rowspan="2">Model</th> <td class="ltx_td ltx_align_center ltx_border_tt" colspan="2" id="S4.T4.15.15.16.1.2">Dataset w/ Filter</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T4.15.15.16.1.3">Street</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T4.15.15.16.1.4">City</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S4.T4.15.15.16.1.5">Country</td> </tr> <tr class="ltx_tr" id="S4.T4.15.15.17.2"> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.17.2.1">Train</td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.17.2.2">Test</td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.17.2.3">1km</td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.17.2.4">25km</td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.17.2.5">750km</td> </tr> <tr class="ltx_tr" id="S4.T4.2.2.2"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T4.2.2.2.3">PlaNet</th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.1.1.1.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.1.1.1.1.m1.1"><semantics id="S4.T4.1.1.1.1.m1.1a"><mo id="S4.T4.1.1.1.1.m1.1.1" mathsize="80%" xref="S4.T4.1.1.1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.1.1.1.1.m1.1b"><times id="S4.T4.1.1.1.1.m1.1.1.cmml" xref="S4.T4.1.1.1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.1.1.1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.1.1.1.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.2.2.2.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.2.2.2.2.m1.1"><semantics id="S4.T4.2.2.2.2.m1.1a"><mo id="S4.T4.2.2.2.2.m1.1.1" mathsize="80%" xref="S4.T4.2.2.2.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.2.2.2.2.m1.1b"><times id="S4.T4.2.2.2.2.m1.1.1.cmml" xref="S4.T4.2.2.2.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.2.2.2.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.2.2.2.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.2.2.2.4">0.09</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.2.2.2.5">0.25</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.2.2.2.6">0.48</td> </tr> <tr class="ltx_tr" id="S4.T4.4.4.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.4.4.4.3">CPlaNet</th> <td class="ltx_td ltx_align_center" id="S4.T4.3.3.3.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.3.3.3.1.m1.1"><semantics id="S4.T4.3.3.3.1.m1.1a"><mo id="S4.T4.3.3.3.1.m1.1.1" mathsize="80%" xref="S4.T4.3.3.3.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.3.3.3.1.m1.1b"><times id="S4.T4.3.3.3.1.m1.1.1.cmml" xref="S4.T4.3.3.3.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.3.3.3.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.3.3.3.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.4.4.4.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.4.4.4.2.m1.1"><semantics id="S4.T4.4.4.4.2.m1.1a"><mo id="S4.T4.4.4.4.2.m1.1.1" mathsize="80%" xref="S4.T4.4.4.4.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.4.4.4.2.m1.1b"><times id="S4.T4.4.4.4.2.m1.1.1.cmml" xref="S4.T4.4.4.4.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.4.4.4.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.4.4.4.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.4.4.4.4">0.10</td> <td class="ltx_td ltx_align_center" id="S4.T4.4.4.4.5">0.27</td> <td class="ltx_td ltx_align_center" id="S4.T4.4.4.4.6">0.49</td> </tr> <tr class="ltx_tr" id="S4.T4.6.6.6"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.6.6.6.3">ISNs</th> <td class="ltx_td ltx_align_center" id="S4.T4.5.5.5.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.5.5.5.1.m1.1"><semantics id="S4.T4.5.5.5.1.m1.1a"><mo id="S4.T4.5.5.5.1.m1.1.1" mathsize="80%" xref="S4.T4.5.5.5.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.5.5.5.1.m1.1b"><times id="S4.T4.5.5.5.1.m1.1.1.cmml" xref="S4.T4.5.5.5.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.5.5.5.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.5.5.5.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.6.6.6.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.6.6.6.2.m1.1"><semantics id="S4.T4.6.6.6.2.m1.1a"><mo id="S4.T4.6.6.6.2.m1.1.1" mathsize="80%" xref="S4.T4.6.6.6.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.6.6.6.2.m1.1b"><times id="S4.T4.6.6.6.2.m1.1.1.cmml" xref="S4.T4.6.6.6.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.6.6.6.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.6.6.6.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.6.6.6.4">0.11</td> <td class="ltx_td ltx_align_center" id="S4.T4.6.6.6.5">0.28</td> <td class="ltx_td ltx_align_center" id="S4.T4.6.6.6.6">0.50</td> </tr> <tr class="ltx_tr" id="S4.T4.8.8.8"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.8.8.8.3">Translocator</th> <td class="ltx_td ltx_align_center" id="S4.T4.7.7.7.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.7.7.7.1.m1.1"><semantics id="S4.T4.7.7.7.1.m1.1a"><mo id="S4.T4.7.7.7.1.m1.1.1" mathsize="80%" xref="S4.T4.7.7.7.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.7.7.7.1.m1.1b"><times id="S4.T4.7.7.7.1.m1.1.1.cmml" xref="S4.T4.7.7.7.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.7.7.7.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.7.7.7.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.8.8.8.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.8.8.8.2.m1.1"><semantics id="S4.T4.8.8.8.2.m1.1a"><mo id="S4.T4.8.8.8.2.m1.1.1" mathsize="80%" xref="S4.T4.8.8.8.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.8.8.8.2.m1.1b"><times id="S4.T4.8.8.8.2.m1.1.1.cmml" xref="S4.T4.8.8.8.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.8.8.8.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.8.8.8.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.8.8.8.4">0.12</td> <td class="ltx_td ltx_align_center" id="S4.T4.8.8.8.5">0.31</td> <td class="ltx_td ltx_align_center" id="S4.T4.8.8.8.6">0.59</td> </tr> <tr class="ltx_tr" id="S4.T4.10.10.10"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.10.10.10.3">GeoDecoder</th> <td class="ltx_td ltx_align_center" id="S4.T4.9.9.9.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.9.9.9.1.m1.1"><semantics id="S4.T4.9.9.9.1.m1.1a"><mo id="S4.T4.9.9.9.1.m1.1.1" mathsize="80%" xref="S4.T4.9.9.9.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.9.9.9.1.m1.1b"><times id="S4.T4.9.9.9.1.m1.1.1.cmml" xref="S4.T4.9.9.9.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.9.9.9.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.9.9.9.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.10.10.10.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.10.10.10.2.m1.1"><semantics id="S4.T4.10.10.10.2.m1.1a"><mo id="S4.T4.10.10.10.2.m1.1.1" mathsize="80%" xref="S4.T4.10.10.10.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.10.10.10.2.m1.1b"><times id="S4.T4.10.10.10.2.m1.1.1.cmml" xref="S4.T4.10.10.10.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.10.10.10.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.10.10.10.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.10.10.10.4">0.13</td> <td class="ltx_td ltx_align_center" id="S4.T4.10.10.10.5">0.34</td> <td class="ltx_td ltx_align_center" id="S4.T4.10.10.10.6">0.61</td> </tr> <tr class="ltx_tr" id="S4.T4.12.12.12"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.12.12.12.3">GeoCLIP</th> <td class="ltx_td ltx_align_center" id="S4.T4.11.11.11.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.11.11.11.1.m1.1"><semantics id="S4.T4.11.11.11.1.m1.1a"><mo id="S4.T4.11.11.11.1.m1.1.1" mathsize="80%" xref="S4.T4.11.11.11.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.11.11.11.1.m1.1b"><times id="S4.T4.11.11.11.1.m1.1.1.cmml" xref="S4.T4.11.11.11.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.11.11.11.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.11.11.11.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.12.12.12.2"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.12.12.12.2.m1.1"><semantics id="S4.T4.12.12.12.2.m1.1a"><mo id="S4.T4.12.12.12.2.m1.1.1" mathsize="80%" xref="S4.T4.12.12.12.2.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.12.12.12.2.m1.1b"><times id="S4.T4.12.12.12.2.m1.1.1.cmml" xref="S4.T4.12.12.12.2.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.12.12.12.2.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.12.12.12.2.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.12.12.12.4">0.14</td> <td class="ltx_td ltx_align_center" id="S4.T4.12.12.12.5">0.34</td> <td class="ltx_td ltx_align_center" id="S4.T4.12.12.12.6">0.70</td> </tr> <tr class="ltx_tr" id="S4.T4.13.13.13"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t" id="S4.T4.13.13.13.2">ISNs</th> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.13.13.13.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.13.13.13.1.m1.1"><semantics id="S4.T4.13.13.13.1.m1.1a"><mo id="S4.T4.13.13.13.1.m1.1.1" mathsize="80%" xref="S4.T4.13.13.13.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.13.13.13.1.m1.1b"><times id="S4.T4.13.13.13.1.m1.1.1.cmml" xref="S4.T4.13.13.13.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.13.13.13.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.13.13.13.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.13.13.13.3">✓</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.13.13.13.4">0.10</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.13.13.13.5">0.29</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S4.T4.13.13.13.6">0.59</td> </tr> <tr class="ltx_tr" id="S4.T4.14.14.14"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.14.14.14.2">GeoCLIP</th> <td class="ltx_td ltx_align_center" id="S4.T4.14.14.14.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.14.14.14.1.m1.1"><semantics id="S4.T4.14.14.14.1.m1.1a"><mo id="S4.T4.14.14.14.1.m1.1.1" mathsize="80%" xref="S4.T4.14.14.14.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.14.14.14.1.m1.1b"><times id="S4.T4.14.14.14.1.m1.1.1.cmml" xref="S4.T4.14.14.14.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.14.14.14.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.14.14.14.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.14.14.14.3">✓</td> <td class="ltx_td ltx_align_center" id="S4.T4.14.14.14.4">0.12</td> <td class="ltx_td ltx_align_center" id="S4.T4.14.14.14.5">0.38</td> <td class="ltx_td ltx_align_center" id="S4.T4.14.14.14.6">0.83</td> </tr> <tr class="ltx_tr" id="S4.T4.15.15.15"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row" id="S4.T4.15.15.15.2">GeoReasoner</th> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.15.1"><math alttext="\times" class="ltx_Math" display="inline" id="S4.T4.15.15.15.1.m1.1"><semantics id="S4.T4.15.15.15.1.m1.1a"><mo id="S4.T4.15.15.15.1.m1.1.1" mathsize="80%" xref="S4.T4.15.15.15.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S4.T4.15.15.15.1.m1.1b"><times id="S4.T4.15.15.15.1.m1.1.1.cmml" xref="S4.T4.15.15.15.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S4.T4.15.15.15.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S4.T4.15.15.15.1.m1.1d">×</annotation></semantics></math></td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.15.3">✓</td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.15.4">0.09</td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.15.5">0.35</td> <td class="ltx_td ltx_align_center" id="S4.T4.15.15.15.6">0.74</td> </tr> <tr class="ltx_tr" id="S4.T4.15.15.18.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb" id="S4.T4.15.15.18.3.1">GeoReasoner</th> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T4.15.15.18.3.2">✓</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T4.15.15.18.3.3">✓</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T4.15.15.18.3.4">0.10</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T4.15.15.18.3.5">0.38</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S4.T4.15.15.18.3.6">0.83</td> </tr> </tbody> </table> </figure> </section> <section class="ltx_subsubsection" id="S4.SS2.SSS4"> <h4 class="ltx_title ltx_title_subsubsection"> 4.2.4 Generalizability Evaluation</h4> <div class="ltx_para" id="S4.SS2.SSS4.p1"> To further assess the generalizability of Georeasoner in geo-localization, we conduct additional testing on open Flickr image datasets of Im2GPS <cite class="ltx_cite ltx_citemacro_citep">(Hays & Efros, <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib12" title="">2008</a>)</cite> and Im2GPS3k <cite class="ltx_cite ltx_citemacro_citep">(Vo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib30" title="">2017</a>)</cite>. Here we use only 10k Flickr images for fine-tuning Georeasoner. Since Georeasoner predicts city names rather than GPS coordinates, we first convert the predicted city names generated by Georeasoner into the GPS coordinates of their respective city centers, then measure the distance between these predicted coordinates and ground-truth locations. </div> <div class="ltx_para" id="S4.SS2.SSS4.p2"> Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T3" title="Table 3 ‣ 4.2.2 Quantitative Comparison with SOTA ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">3</a> and Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T4" title="Table 4 ‣ 4.2.3 Ablation Experiments ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4</a> present the performance comparison of Georeasoner with PlaNet <cite class="ltx_cite ltx_citemacro_citep">(Weyand et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib32" title="">2016</a>)</cite>, CPlaNet <cite class="ltx_cite ltx_citemacro_citep">(Seo et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib25" title="">2018</a>)</cite>, ISNs <cite class="ltx_cite ltx_citemacro_citep">(Müller-Budack et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib19" title="">2018</a>)</cite>, Translocator <cite class="ltx_cite ltx_citemacro_citep">(Pramanick et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib20" title="">2022</a>)</cite>, GeoDecoder <cite class="ltx_cite ltx_citemacro_citep">(Clark et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib8" title="">2023</a>)</cite>, and GeoCLIP <cite class="ltx_cite ltx_citemacro_citep">(Vivanco Cepeda et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#bib.bib29" title="">2024</a>)</cite> on Im2GPS and Im2GPS3k datasets, respectively. The results demonstrate that fine-tuning GeoReasoner using highly locatable images significantly improves prediction accuracy for street, city, and country levels (row 8 vs. row 9 in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T3" title="Table 3 ‣ 4.2.2 Quantitative Comparison with SOTA ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">3</a>, and row 9 vs. row 10 in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.T4" title="Table 4 ‣ 4.2.3 Ablation Experiments ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4</a>). Remarkably, despite being fine-tuned solely on a smaller number of Flickr images, GeoReasoner achieves results comparable to ISNs and GeoCLIP trained on millions of Flickr images, particularly in terms of city- and country-level accuracy. Besides, GeoReasoner trained on the filtered, highly locatable Flickr images also show improvements in the city- and country-level geo-localization, demonstrating the generalizability of our proposed locatability module. </div> </section> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> 5 Discussion</h2> <div class="ltx_para" id="S5.p1"> The significance of high-locatability street-view images. We observe a significant performance improvement when GeoReasoner is trained upon high-locatability street-view images. Such images often contain explicit visual clues such as stylized architecture, traffic signs, and landmarks, providing the model with richer contextual information. Therefore, increasing the quality of the training dataset enhances the model’s geo-localization performance. Additionally, the quantity of high-locatability images is vital, as the model trained with 70K images (as in Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS2.SSS2" title="4.2.2 Quantitative Comparison with SOTA ‣ 4.2 Experiments on Geo-localization with Reasoning ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.2.2</a>) achieves significantly higher accuracy than the one trained with 10K images (Sect. <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S4.SS1.SSS2" title="4.1.2 Quantitative Comparison ‣ 4.1 Experiments on Locatability-Enhanced Dataset ‣ 4 Experiments ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">4.1.2</a>). In balancing the quality and quantity of the training dataset, we empirically applied a threshold of 0.4 to differentiate between highly and less localizable street views. Setting the threshold too high (e.g., 0.7) can lead to a notable decrease in the amount of high-locatability images, whilst a lower threshold (e.g., 0.1) may bring in introduce low-quality images. </div> <div class="ltx_para" id="S5.p2"> The necessity of reasoning process. The introduction of the reasoning component successfully elevated GeoReasoner’s performance in the geo-localization task. This signifies that LVLM can adeptly capture intricate relationships among image features, location clues, and geo-locations in the training process. Implemented an innovative solution to empower the reasoning capability within GeoReasoner by leveraging human inference knowledge extracted from geo-localization games. Despite the relatively small dataset, a noticeable improvement in performance has been achieved. In the future, we plan to expand the reasoning dataset by diversifying the influencing clues. For instance, the current textual clues are absent of landscape information, which could provide invaluable insights for geo-localization. We will collaborate with domain experts such as urban planners and geographers to address these limitations. </div> <div class="ltx_para" id="S5.p3"> Failure cases. GeoReasoner comprehends architectural style as a pivotal factor in geo-localization. However, the model can be misled by the learned significance of architectural style. Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#S5.F8" title="Figure 8 ‣ 5 Discussion ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">8</a> presents a street view of the Eiffel Tower in Paris, France (left), and replicas of the Eiffel Tower in New York, USA (middle) and in Hangzhou, China (right). GeoReasoner fails to distinguish between them, predicting all instances as located in Paris, France. This misclassification is not unique to GeoReasoner but also extends to other LVLMs like GPT-4V. Consequently, it underscores the necessity for LVLM-based methods to delve deeper into knowledge for more sophisticated geo-localization capabilities. Once again, it is imperative to collaborate with domain experts and enhance the visual clues and reasoning procedure comprehensively to tackle this issue. </div> <figure class="ltx_figure" id="S5.F8"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="407" id="S5.F8.1.g1" src="x7.png" width="821"/> <figcaption class="ltx_caption ltx_centering">Figure 8: GeoReasoner fails to distinguish the Eiffel Tower and its replicas in New York, USA, and Hangzhou, China.</figcaption> </figure> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> 6 Conclusion</h2> <div class="ltx_para" id="S6.p1"> In this paper, we present a new paradigm that integrates a large vision-language (LVLM) model with human inference knowledge for street view geo-localization with reasoning. We introduce the concept of locatability and devise a CLIP-based network to quantify the degree of locatability in street-view images, facilitating the selection of high-quality data. We design an LVLM-based model named GeoReasoner, which harnesses external knowledge of human inference from real geo-localization games and curated high-quality data to enhance the performance of geo-localization tasks with reasoning capabilities. The model undergoes two-stage fine-tuning, namely reasoning tuning and location tuning. The reasoning tuning stage aims to acquire potential linkage between coarse-grained geographical locations (i.e., country) and the associated positioning reasons. In location tuning stage, we employ the curated high-quality data to further refine the model in fine-grained geo-localization (i.e., city) learning. Extensive experiments prove that GeoReasoner outperforms previous models qualitatively and quantitatively. </div> </section> <section class="ltx_section" id="Sx1"> <h2 class="ltx_title ltx_title_section">Acknowledgements</h2> <div class="ltx_para" id="Sx1.p1"> We would like to thank Yao Zhou and Wenqi Shao for their insightful discussions and Ziyao Gao for her assistance in drawing the figures in this paper. We also extend our gratitude to the anonymous reviewers for their valuable comments. This work is partially supported by the National Natural Science Foundation of China (62172398, 42171456, 52078343). </div> </section> <section class="ltx_section" id="Sx2"> <h2 class="ltx_title ltx_title_section">Impact Statement</h2> <div class="ltx_para" id="Sx2.p1"> GeoReasoner advances image-based geo-localization technologies that are pivotal for many applications such as autonomous navigation. The pipeline of constructing the dataset featuring high-locatability street views proves highly beneficial across multiple scenarios, such as urban studies, culture studies, and digital humanities, all of which are increasingly reliant on the analysis of high-quality street-view data. </div> <div class="ltx_para" id="Sx2.p2"> The proposed paradigm represents the fusion of LVLM with human inference knowledge, which has implications for the advancement of artificial intelligence (AI) that is more aligned with human cognition. The synergy can lead to the creation of AI that is not only more effective in complex inference tasks but also more understandable and relatable to human users. As AI becomes more pervasive in daily life, the importance of designing systems that are both transparent and capable of complex reasoning cannot be overstated. </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> Achiam et al. (2023) Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023. </li> <li class="ltx_bibitem" id="bib.bib2"> Bai et al. (2023a) Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966, 2023a. </li> <li class="ltx_bibitem" id="bib.bib3"> Bai et al. (2023b) Bai, Y., Shang, C., Li, Y., Shen, L., Jin, S., and Shen, Q. Transport Object Detection in Street View Imagery Using Decomposed Convolutional Neural Networks. Mathematics, 11(18):3839, 2023b. </li> <li class="ltx_bibitem" id="bib.bib4"> Campbell et al. (2019) Campbell, A., Both, A., and Sun, Q. C. Detecting and mapping traffic signs from Google Street View images using deep learning and GIS. Computers, Environment and Urban Systems, 77:101350, 2019. </li> <li class="ltx_bibitem" id="bib.bib5"> Chalvatzaras et al. (2022) Chalvatzaras, A., Pratikakis, I., and Amanatiadis, A. A. A Survey on Map-Based Localization Techniques for Autonomous Vehicles. IEEE Transactions on Intelligent Vehicles, 8(2):1574–1596, 2022. </li> <li class="ltx_bibitem" id="bib.bib6"> Cheng et al. (2021) Cheng, B., Schwing, A., and Kirillov, A. Per-Pixel Classification is Not All You Need for Semantic Segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021. </li> <li class="ltx_bibitem" id="bib.bib7"> Cheng et al. (2022) Cheng, W., Wen, R., Huang, H., Miao, W., and Wang, C. OPTDP: Towards optimal personalized trajectory differential privacy for trajectory data publishing. Neurocomputing, 472:201–211, 2022. </li> <li class="ltx_bibitem" id="bib.bib8"> Clark et al. (2023) Clark, B., Kerrigan, A., Kulkarni, P. P., Cepeda, V. V., and Shah, M. Where We Are and What We’re Looking At: Query Based Worldwide Image Geo-Localization Using Hierarchies and Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23182–23190, 2023. </li> <li class="ltx_bibitem" id="bib.bib9"> Dai et al. (2023) Dai, D., Sun, Y., Dong, L., Hao, Y., Ma, S., Sui, Z., and Wei, F. Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 4005–4019, 2023. </li> <li class="ltx_bibitem" id="bib.bib10"> Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of International Conference on Learning Representations, 2021. </li> <li class="ltx_bibitem" id="bib.bib11"> Haas et al. (2023) Haas, L., Alberti, S., and Skreta, M. Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization. arXiv preprint arXiv:2302.00275, 2023. </li> <li class="ltx_bibitem" id="bib.bib12"> Hays & Efros (2008) Hays, J. and Efros, A. A. IM2GPS: estimating geographic information from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE, 2008. </li> <li class="ltx_bibitem" id="bib.bib13"> Hu et al. (2022) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of International Conference on Learning Representations, 2022. </li> <li class="ltx_bibitem" id="bib.bib14"> Kenton & Toutanova (2019) Kenton, J. D. M.-W. C. and Toutanova, L. K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019. </li> <li class="ltx_bibitem" id="bib.bib15"> Li et al. (2021) Li, Y., Wu, C., Li, L., Liu, Y., and Zhu, J. Caption Generation From Road Images for Traffic Scene Modeling. IEEE Transactions on Intelligent Transportation Systems, 23(7):7805–7816, 2021. </li> <li class="ltx_bibitem" id="bib.bib16"> Lin et al. (2022) Lin, J., Zheng, Z., Zhong, Z., Luo, Z., Li, S., Yang, Y., and Sebe, N. Joint Representation Learning and Keypoint Detection for Cross-View Geo-Localization. IEEE Transactions on Image Processing, 31:3780–3792, 2022. </li> <li class="ltx_bibitem" id="bib.bib17"> Liu et al. (2024) Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual Instruction Tuning. Advances in Neural Information Processing Systems, 36, 2024. </li> <li class="ltx_bibitem" id="bib.bib18"> Luo et al. (2022) Luo, G., Biamby, G., Darrell, T., Fried, D., and Rohrbach, A. G3: Geolocation via guidebook grounding. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 5841–5853, 2022. </li> <li class="ltx_bibitem" id="bib.bib19"> Müller-Budack et al. (2018) Müller-Budack, E., Pustu-Iren, K., and Ewerth, R. Geolocation Estimation of Photos using a Hierarchical Model and Scene Classification. In Proceedings of the European Conference on Computer Vision, pp. 563–579, 2018. </li> <li class="ltx_bibitem" id="bib.bib20"> Pramanick et al. (2022) Pramanick, S., Nowara, E. M., Gleason, J., Castillo, C. D., and Chellappa, R. Where in the World is this Image? Transformer-based Geo-localization in the Wild. In Proceedings of the European Conference on Computer Vision, pp. 196–215, 2022. </li> <li class="ltx_bibitem" id="bib.bib21"> Qiao et al. (2023) Qiao, S., Ou, Y., Zhang, N., Chen, X., Yao, Y., Deng, S., Tan, C., Huang, F., and Chen, H. Reasoning with Language Model Prompting: A Survey. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 5368–5393, 2023. </li> <li class="ltx_bibitem" id="bib.bib22"> Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021. </li> <li class="ltx_bibitem" id="bib.bib23"> Rao et al. (2023) Rao, J., Shan, Z., Liu, L., Zhou, Y., and Yang, Y. Retrieval-based Knowledge Augmented Vision Language Pre-training. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 5399–5409, 2023. </li> <li class="ltx_bibitem" id="bib.bib24"> Reimers & Gurevych (2019) Reimers, N. and Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 3980–3990, 2019. </li> <li class="ltx_bibitem" id="bib.bib25"> Seo et al. (2018) Seo, P. H., Weyand, T., Sim, J., and Han, B. CPlaNet: Enhancing Image Geolocalization by Combinatorial Partitioning of Maps. In Proceedings of the European Conference on Computer Vision, pp. 536–551, 2018. </li> <li class="ltx_bibitem" id="bib.bib26"> Shao et al. (2023) Shao, Z., Yu, Z., Wang, M., and Yu, J. Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14974–14983, 2023. </li> <li class="ltx_bibitem" id="bib.bib27"> Shen et al. (2018) Shen, Q., Zeng, W., Ye, Y., Mueller Arisona, S., Schubiger, S., Burkhard, R., and Qu, H. StreetVizor: Visual Exploration of Human-Scale Urban Forms Based on Street Views. IEEE Transactions on Visualization and Computer Graphics, 24(1):1004–1013, 2018. </li> <li class="ltx_bibitem" id="bib.bib28"> Theiner et al. (2022) Theiner, J., Müller-Budack, E., and Ewerth, R. Interpretable Semantic Photo Geolocation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 750–760, 2022. </li> <li class="ltx_bibitem" id="bib.bib29"> Vivanco Cepeda et al. (2024) Vivanco Cepeda, V., Nayak, G. K., and Shah, M. GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization. Advances in Neural Information Processing Systems, 36, 2024. </li> <li class="ltx_bibitem" id="bib.bib30"> Vo et al. (2017) Vo, N., Jacobs, N., and Hays, J. Revisiting IM2GPS in the Deep Learning Era. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2621–2630, 2017. </li> <li class="ltx_bibitem" id="bib.bib31"> Wei et al. (2022) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022. </li> <li class="ltx_bibitem" id="bib.bib32"> Weyand et al. (2016) Weyand, T., Kostrikov, I., and Philbin, J. PlaNet - Photo Geolocation with Convolutional Neural Networks. In Proceedings of the European Conference on Computer Vision, pp. 37–55, 2016. </li> <li class="ltx_bibitem" id="bib.bib33"> Xu et al. (2023) Xu, P., Shao, W., Zhang, K., Gao, P., Liu, S., Lei, M., Meng, F., Huang, S., Qiao, Y., and Luo, P. LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models. arXiv preprint arXiv:2306.09265, 2023. </li> <li class="ltx_bibitem" id="bib.bib34"> Yao et al. (2024) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in Neural Information Processing Systems, 36, 2024. </li> <li class="ltx_bibitem" id="bib.bib35"> Ye et al. (2019a) Ye, Y., Richards, D., Lu, Y., Song, X., Zhuang, Y., Zeng, W., and Zhong, T. Measuring daily accessed street greenery: A human-scale approach for informing better urban planning practices. Landscape and Urban Planning, 191:103434, 2019a. </li> <li class="ltx_bibitem" id="bib.bib36"> Ye et al. (2019b) Ye, Y., Zeng, W., Shen, Q., Zhang, X., and Lu, Y. The visual quality of streets: A human-centred continuous measurement based on machine learning algorithms and street view images. Environment and Planning B: Urban Analytics and City Science, 46(8):1439–1457, 2019b. </li> <li class="ltx_bibitem" id="bib.bib37"> Ying et al. (2024) Ying, K., Meng, F., Wang, J., Li, Z., Lin, H., Yang, Y., Zhang, H., Zhang, W., Lin, Y., Liu, S., et al. MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI. arXiv preprint arXiv:2404.16006, 2024. </li> <li class="ltx_bibitem" id="bib.bib38"> Zhang et al. (2018) Zhang, F., Zhang, D., Liu, Y., and Lin, H. Representing place locales using scene elements. Computers, Environment and Urban Systems, 71:153–164, 2018. </li> <li class="ltx_bibitem" id="bib.bib39"> Zhang et al. (2023a) Zhang, H., Song, H., Li, S., Zhou, M., and Song, D. A Survey of Controllable Text Generation Using Transformer-based Pre-trained Language Models. ACM Computing Surveys, 56(3):1–37, 2023a. </li> <li class="ltx_bibitem" id="bib.bib40"> Zhang et al. (2023b) Zhang, X., Li, X., Sultani, W., Zhou, Y., and Wshah, S. Cross-View Geo-Localization via Learning Disentangled Geometric Layout Correspondence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 3480–3488, 2023b. </li> <li class="ltx_bibitem" id="bib.bib41"> Zhu et al. (2022) Zhu, S., Shah, M., and Chen, C. TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1162–1171, 2022. </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> <section class="ltx_appendix" id="A1"> <h2 class="ltx_title ltx_title_appendix"> Appendix A Implementation Details</h2> <div class="ltx_para" id="A1.p1"> Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#A1.T5" title="Table 5 ‣ Appendix A Implementation Details ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">5</a> and Table <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#A1.T6" title="Table 6 ‣ Appendix A Implementation Details ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">6</a> present the hyper-parameter settings and training details for the models. We conducted training and testing on Nvidia A800 (80G), with CUDA 12.1, PyTorch 2.0.0, and Transformers 4.33.0. </div> <figure class="ltx_table" id="A1.T5"> <figcaption class="ltx_caption" style="font-size:80%;">Table 5: The hyper-parameter settings of the proposed GeoReasoner.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="A1.T5.6"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="A1.T5.6.1.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_tt" id="A1.T5.6.1.1.1">Hyper Params</th> <td class="ltx_td ltx_align_center ltx_border_tt" id="A1.T5.6.1.1.2">Value</td> </tr> <tr class="ltx_tr" id="A1.T5.6.2.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t" id="A1.T5.6.2.2.1">Learning Rate</th> <td class="ltx_td ltx_align_center ltx_border_t" id="A1.T5.6.2.2.2">1e-5</td> </tr> <tr class="ltx_tr" id="A1.T5.6.3.3"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.3.3.1">Total Batch Size</th> <td class="ltx_td ltx_align_center" id="A1.T5.6.3.3.2">64</td> </tr> <tr class="ltx_tr" id="A1.T5.6.4.4"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.4.4.1">Weight Decay</th> <td class="ltx_td ltx_align_center" id="A1.T5.6.4.4.2">0.1</td> </tr> <tr class="ltx_tr" id="A1.T5.6.5.5"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.5.5.1">Warmup Ratio</th> <td class="ltx_td ltx_align_center" id="A1.T5.6.5.5.2">0.01</td> </tr> <tr class="ltx_tr" id="A1.T5.6.6.6"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.6.6.1">Optimizer</th> <td class="ltx_td ltx_align_center" id="A1.T5.6.6.6.2">AdamW</td> </tr> <tr class="ltx_tr" id="A1.T5.6.7.7"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.7.7.1">Adam Beta1</th> <td class="ltx_td ltx_align_center" id="A1.T5.6.7.7.2">0.9</td> </tr> <tr class="ltx_tr" id="A1.T5.6.8.8"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.8.8.1">Adam Beta2</th> <td class="ltx_td ltx_align_center" id="A1.T5.6.8.8.2">0.95</td> </tr> <tr class="ltx_tr" id="A1.T5.6.9.9"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row" id="A1.T5.6.9.9.1">LR Scheduler</th> <td class="ltx_td ltx_align_center" id="A1.T5.6.9.9.2">cosine</td> </tr> <tr class="ltx_tr" id="A1.T5.6.10.10"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb" id="A1.T5.6.10.10.1">Model Max Length</th> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T5.6.10.10.2">2048</td> </tr> </tbody> </table> </figure> <figure class="ltx_table" id="A1.T6"> <figcaption class="ltx_caption" style="font-size:80%;">Table 6: The training details of the proposed GeoReasoner.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="A1.T6.6"> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="A1.T6.6.1.1"> <th class="ltx_td ltx_th ltx_th_row ltx_border_tt" id="A1.T6.6.1.1.1"></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="A1.T6.6.1.1.2" rowspan="2">Training Speed</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="A1.T6.6.1.1.3" rowspan="2">Inference Latency</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="2" id="A1.T6.6.1.1.4">Num of Params</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="A1.T6.6.1.1.5" rowspan="2">Flops</th> </tr> <tr class="ltx_tr" id="A1.T6.6.2.2"> <th class="ltx_td ltx_th ltx_th_row" id="A1.T6.6.2.2.1"></th> <td class="ltx_td ltx_align_center" id="A1.T6.6.2.2.2">Base Model</td> <td class="ltx_td ltx_align_center" id="A1.T6.6.2.2.3">LoRA</td> </tr> <tr class="ltx_tr" id="A1.T6.6.3.3"> <th class="ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_t" id="A1.T6.6.3.3.1">LoRA1 (reason)</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A1.T6.6.3.3.2">0.41 sample/s</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A1.T6.6.3.3.3">1.560s</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A1.T6.6.3.3.4">9.6B</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A1.T6.6.3.3.5">112.19M</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A1.T6.6.3.3.6">71.9B</th> </tr> <tr class="ltx_tr" id="A1.T6.6.4.4"> <th class="ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb" id="A1.T6.6.4.4.1">LoRA2 (location)</th> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T6.6.4.4.2">0.63 sample/s</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T6.6.4.4.3">0.894s</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T6.6.4.4.4">9.6B</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T6.6.4.4.5">112.19M</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A1.T6.6.4.4.6">71.9B</td> </tr> </tbody> </table> </figure> </section> <section class="ltx_appendix" id="A2"> <h2 class="ltx_title ltx_title_appendix"> Appendix B Additional Qualitative Results</h2> <div class="ltx_para" id="A2.p1"> Additionally, we present the results of the GeoReasoner on alternative street-view images, depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.18572v2#A2.F9" title="Figure 9 ‣ Appendix B Additional Qualitative Results ‣ GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model">9</a>. Each street view image is annotated with the ground truth geographic location, along with the inference results from GeoReasoner. It can provide geographical predictions accompanied by reasonable explanations. </div> <figure class="ltx_figure" id="A2.F9"> <img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="399" id="A2.F9.1.1.1.g1" src="x8.png" width="822"/> <figcaption class="ltx_caption ltx_centering">Figure 9: Additional qualitative results from the proposed GeoReasoner.</figcaption> </figure> <div class="ltx_pagination ltx_role_newpage"></div> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Thu Oct 17 03:24:45 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/">LaTeXML<img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>