arXiv:2405.10255
cs.CV
cs.RO
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models
Authors: Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu Ding</a>, <a href="/search/cs?searchtype=author&query=Gu%2C+J">Jindong Gu</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+D+Z">Dave Zhenyu Chen</a>, <a href="/search/cs?searchtype=author&query=Peng%2C+S">Songyou Peng</a>, <a href="/search/cs?searchtype=author&query=Bian%2C+J">Jia-Wang Bian</a>, <a href="/search/cs?searchtype=author&query=Torr%2C+P+H">Philip H Torr</a>, <a href="/search/cs?searchtype=author&query=Pollefeys%2C+M">Marc Pollefeys</a>, <a href="/search/cs?searchtype=author&query=Nie%C3%9Fner%2C+M">Matthias Nie脽ner</a>, <a href="/search/cs?searchtype=author&query=Reid%2C+I+D">Ian D Reid</a>, <a href="/search/cs?searchtype=author&query=Chang%2C+A+X">Angel X. Chang</a>, <a href="/search/cs?searchtype=author&query=Laina%2C+I">Iro Laina</a>, <a href="/search/cs?searchtype=author&query=Prisacariu%2C+V+A">Victor Adrian Prisacariu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.10255v1-abstract-short" style="display: inline;"> As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context lear… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.10255v1-abstract-full').style.display = 'inline'; document.getElementById('2405.10255v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.10255v1-abstract-full" style="display: none;"> As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed:
Submitted 16 May, 2024; originally announced May 2024. Reid
Image Matching: An Application-oriented Benchmark
Authors: JiaWang Bian, Le Zhang, Yun Liu, Wen-Yan Lin, Ming-Ming Cheng, Ian D. Reid To this end, we present a uniform benchmark with novel evaluation metrics and a large-scale dataset for evaluating the overall perfo… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('1709.03917v4-abstract-full').style.display = 'inline'; document.getElementById('1709.03917v4-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="1709.03917v4-abstract-full" style="display: none;"> Image matching approaches have been widely used in computer vision applications in which the image-level matching performance of matchers is critical. However, it has not been well investigated by previous works which place more emphases on evaluating local features. To this end, we present a uniform benchmark with novel evaluation metrics and a large-scale dataset for evaluating the overall performance of image matching methods. The proposed metrics are application-oriented as they emphasize application requirements for matchers. The dataset contains two portions for benchmarking video frame matching and unordered image matching separately, where each portion consists of real-world image sequences and each sequence has a specific attribute. Subsequently, we carry out a comprehensive performance evaluation of different state-of-the-art methods and conduct in-depth analyses regarding various aspects such as application requirements, matching types, and data diversity. Moreover, we shed light on how to choose appropriate approaches for different applications based on empirical results and analyses. Conclusions in this benchmark can be used as general guidelines to design practical matching systems and also advocate potential future research directions in this field.
Submitted 7 August, 2018; v1 submitted 12 September, 2017; originally announced September 2017.
Comments: We made a significant change and re-submitted as "MatchBench: An Evaluation of Feature Matchers" Using a single graphic card, our implementation achieves speedups of up to $83\times$ from the standard sequential implementation. Our implementation is fully compatible with the standard sequential implementation and the software is now available online and is open source. </span> <span class="abstract-full has-text-grey-dark mathjax" id="1509.04232v1-abstract-full" style="display: none;"> We introduce a parallel GPU implementation of the Simple Linear Iterative Clustering (SLIC) superpixel segmentation. Using a single graphic card, our implementation achieves speedups of up to $83\times$ from the standard sequential implementation. Our implementation is fully compatible with the standard sequential implementation and the software is now available online and is open source.
Submitted 14 September, 2015; originally announced September 2015. S. Torr</a>, <a href="/search/cs?searchtype=author&query=Reid%2C+I+D">Ian D. Reid</a>, <a href="/search/cs?searchtype=author&query=Murray%2C+D+W">David W. Murray</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="1410.0925v3-abstract-short" style="display: inline;"> Volumetric models have become a popular representation for 3D scenes in recent years. One of the breakthroughs leading to their popularity was KinectFusion, where the focus is on 3D reconstruction using RGB-D sensors. However, monocular SLAM has since also been tackled with very similar approaches. Representing the reconstruction volumetrically as a truncated signed distance function leads to most… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('1410.0925v3-abstract-full').style.display = 'inline'; document.getElementById('1410.0925v3-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="1410.0925v3-abstract-full" style="display: none;"> Volumetric models have become a popular representation for 3D scenes in recent years. One of the breakthroughs leading to their popularity was KinectFusion, where the focus is on 3D reconstruction using RGB-D sensors. However, monocular SLAM has since also been tackled with very similar approaches. Representing the reconstruction volumetrically as a truncated signed distance function leads to most of the simplicity and efficiency that can be achieved with GPU implementations of these systems. However, this representation is also memory-intensive and limits the applicability to small scale reconstructions. Several avenues have been explored for overcoming this limitation. With the aim of summarizing them and providing for a fast and flexible 3D reconstruction pipeline, we propose a new, unifying framework called InfiniTAM. The core idea is that individual steps like camera tracking, scene representation and integration of new data can easily be replaced and adapted to the needs of the user. Along with the framework we also provide a set of components for scalable reconstruction: two implementations of camera trackers, based on RGB data and on depth data, two representations of the 3D volumetric data, a dense volume and one based on hashes of subblocks, and an optional module for swapping subblocks in and out of the typically limited GPU memory.
Submitted 23 October, 2014; v1 submitted 3 October, 2014; originally announced October 2014.
Comments: 17 pages, 8 figures 