class="breathe-horizontal" start="1"> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.12785</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Wang%2C+S">Siyuan Wang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+D">Dianyi Wang</a>, <a href="/search/cs?searchtype=author&query=Zhou%2C+C">Chengxing Zhou</a>, <a href="/search/cs?searchtype=author&query=Li%2C+Z">Zejun Li</a>, <a href="/search/cs?searchtype=author&query=Fan%2C+Z">Zhihao Fan</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xuanjing Huang</a>, <a href="/search/cs?searchtype=author&query=Wei%2C+Z">Zhongyu Wei</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.12785v1-abstract-short" style="display: inline;"> Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Drawing inspiration from the concept of visual region in the human brain, we investigate the existence of an analogous \textit{visual region} within LLMs that functions as a cognitive core, and explore the possibility of efficient tr… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.12785v1-abstract-full').style.display = 'inline'; document.getElementById('2412.12785v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.12785v1-abstract-full" style="display: none;"> Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Drawing inspiration from the concept of visual region in the human brain, we investigate the existence of an analogous \textit{visual region} within LLMs that functions as a cognitive core, and explore the possibility of efficient training of LVLMs via selective layers tuning. We use Bunny-Llama-3-8B-V for detailed experiments and LLaVA-1.5-7B and LLaVA-1.5-13B for validation across a range of visual and textual tasks. Our findings reveal that selectively updating 25\% of LLMs layers, when sparsely and uniformly distributed, can preserve nearly 99\% of visual performance while maintaining or enhancing textual task results, and also effectively reducing training time. Based on this targeted training approach, we further propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss. This study offers an effective and efficient strategy for LVLM training and inference by activating a layer-wise visual region within LLMs, which is consistently effective across different models and parameter scales. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.12785v1-abstract-full').style.display = 'none'; document.getElementById('2412.12785v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.12468</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Transferable and Forecastable User Targeting Foundation Model </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Dou%2C+B">Bin Dou</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+B">Baokun Wang</a>, <a href="/search/cs?searchtype=author&query=Zhu%2C+Y">Yun Zhu</a>, <a href="/search/cs?searchtype=author&query=Lin%2C+X">Xiaotong Lin</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+Y">Yike Xu</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaorui Huang</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+Y">Yang Chen</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Y">Yun Liu</a>, <a href="/search/cs?searchtype=author&query=Han%2C+S">Shaoshuai Han</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Y">Yongchao Liu</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+T">Tianyi Zhang</a>, <a href="/search/cs?searchtype=author&query=Cheng%2C+Y">Yu Cheng</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+W">Weiqiang Wang</a>, <a href="/search/cs?searchtype=author&query=Hong%2C+C">Chuntao Hong</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.12468v1-abstract-short" style="display: inline;"> User targeting, the process of selecting targeted users from a pool of candidates for non-expert marketers, has garnered substantial attention with the advancements in digital marketing. However, existing user targeting methods encounter two significant challenges: (i) Poor cross-domain and cross-scenario transferability and generalization, and (ii) Insufficient forecastability in real-world appli… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.12468v1-abstract-full').style.display = 'inline'; document.getElementById('2412.12468v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.12468v1-abstract-full" style="display: none;"> User targeting, the process of selecting targeted users from a pool of candidates for non-expert marketers, has garnered substantial attention with the advancements in digital marketing. However, existing user targeting methods encounter two significant challenges: (i) Poor cross-domain and cross-scenario transferability and generalization, and (ii) Insufficient forecastability in real-world applications. These limitations hinder their applicability across diverse industrial scenarios. In this work, we propose FIND, an industrial-grade, transferable, and forecastable user targeting foundation model. To enhance cross-domain transferability, our framework integrates heterogeneous multi-scenario user data, aligning them with one-sentence targeting demand inputs through contrastive pre-training. For improved forecastability, the text description of each user is derived based on anticipated future behaviors, while user representations are constructed from historical information. Experimental results demonstrate that our approach significantly outperforms existing baselines in cross-domain, real-world user targeting scenarios, showcasing the superior capabilities of FIND. Moreover, our method has been successfully deployed on the Alipay platform and is widely utilized across various scenarios. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.12468v1-abstract-full').style.display = 'none'; document.getElementById('2412.12468v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">9 pages, 4 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.12157</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> What Makes In-context Learning Effective for Mathematical Reasoning: A Theoretical Analysis </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Liu%2C+J">Jiayu Liu</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+Z">Zhenya Huang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+C">Chaokun Wang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xunpeng Huang</a>, <a href="/search/cs?searchtype=author&query=Zhai%2C+C">Chengxiang Zhai</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+E">Enhong Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.12157v1-abstract-short" style="display: inline;"> Owing to the capability of in-context learning, large language models (LLMs) have shown impressive performance across diverse mathematical reasoning benchmarks. However, we find that few-shot demonstrations can sometimes bring negative performance and their effectiveness on LLMs' reasoning abilities remains unreliable. To this end, in this paper, we aim to theoretically analyze the impact of in-co… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.12157v1-abstract-full').style.display = 'inline'; document.getElementById('2412.12157v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.12157v1-abstract-full" style="display: none;"> Owing to the capability of in-context learning, large language models (LLMs) have shown impressive performance across diverse mathematical reasoning benchmarks. However, we find that few-shot demonstrations can sometimes bring negative performance and their effectiveness on LLMs' reasoning abilities remains unreliable. To this end, in this paper, we aim to theoretically analyze the impact of in-context demonstrations on LLMs' reasoning performance. We prove that the reasoning efficacy (measured by empirical prediction loss) can be bounded by a LLM-oriented semantic similarity and an inference stability of demonstrations, which is general for both one-shot and few-shot scenarios. Based on this finding, we propose a straightforward, generalizable, and low-complexity demonstration selection method named LMS3. It can adaptively facilitate to select the most pertinent samples for different LLMs and includes a novel demonstration rejection mechanism to automatically filter out samples that are unsuitable for few-shot learning. Through experiments on three representative benchmarks, two LLM backbones, and multiple few-shot settings, we verify that our LMS3 has superiority and achieves consistent improvements on all datasets, which existing methods have been unable to accomplish. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.12157v1-abstract-full').style.display = 'none'; document.getElementById('2412.12157v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.12129</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> SceneDiffuser: Efficient and Controllable Driving Simulation Initialization and Rollout </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Jiang%2C+C+M">Chiyu Max Jiang</a>, <a href="/search/cs?searchtype=author&query=Bai%2C+Y">Yijing Bai</a>, <a href="/search/cs?searchtype=author&query=Cornman%2C+A">Andre Cornman</a>, <a href="/search/cs?searchtype=author&query=Davis%2C+C">Christopher Davis</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiukun Huang</a>, <a href="/search/cs?searchtype=author&query=Jeon%2C+H">Hong Jeon</a>, <a href="/search/cs?searchtype=author&query=Kulshrestha%2C+S">Sakshum Kulshrestha</a>, <a href="/search/cs?searchtype=author&query=Lambert%2C+J">John Lambert</a>, <a href="/search/cs?searchtype=author&query=Li%2C+S">Shuangyu Li</a>, <a href="/search/cs?searchtype=author&query=Zhou%2C+X">Xuanyu Zhou</a>, <a href="/search/cs?searchtype=author&query=Fuertes%2C+C">Carlos Fuertes</a>, <a href="/search/cs?searchtype=author&query=Yuan%2C+C">Chang Yuan</a>, <a href="/search/cs?searchtype=author&query=Tan%2C+M">Mingxing Tan</a>, <a href="/search/cs?searchtype=author&query=Zhou%2C+Y">Yin Zhou</a>, <a href="/search/cs?searchtype=author&query=Anguelov%2C+D">Dragomir Anguelov</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.12129v1-abstract-short" style="display: inline;"> Realistic and interactive scene simulation is a key prerequisite for autonomous vehicle (AV) development. In this work, we present SceneDiffuser, a scene-level diffusion prior designed for traffic simulation. It offers a unified framework that addresses two key stages of simulation: scene initialization, which involves generating initial traffic layouts, and scene rollout, which encompasses the cl… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.12129v1-abstract-full').style.display = 'inline'; document.getElementById('2412.12129v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.12129v1-abstract-full" style="display: none;"> Realistic and interactive scene simulation is a key prerequisite for autonomous vehicle (AV) development. In this work, we present SceneDiffuser, a scene-level diffusion prior designed for traffic simulation. It offers a unified framework that addresses two key stages of simulation: scene initialization, which involves generating initial traffic layouts, and scene rollout, which encompasses the closed-loop simulation of agent behaviors. While diffusion models have been proven effective in learning realistic and multimodal agent distributions, several challenges remain, including controllability, maintaining realism in closed-loop simulations, and ensuring inference efficiency. To address these issues, we introduce amortized diffusion for simulation. This novel diffusion denoising paradigm amortizes the computational cost of denoising over future simulation steps, significantly reducing the cost per rollout step (16x less inference steps) while also mitigating closed-loop errors. We further enhance controllability through the introduction of generalized hard constraints, a simple yet effective inference-time constraint mechanism, as well as language-based constrained scene generation via few-shot prompting of a large language model (LLM). Our investigations into model scaling reveal that increased computational resources significantly improve overall simulation realism. We demonstrate the effectiveness of our approach on the Waymo Open Sim Agents Challenge, achieving top open-loop performance and the best closed-loop performance among diffusion models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.12129v1-abstract-full').style.display = 'none'; document.getElementById('2412.12129v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted to NeurIPS 2024</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">MSC Class:</span> 68T07 <span class="has-text-black-bis has-text-weight-semibold">ACM Class:</span> I.2.6 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.11812</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> CLDA-YOLO: Visual Contrastive Learning Based Domain Adaptive YOLO Detector </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Qiu%2C+T">Tianheng Qiu</a>, <a href="/search/cs?searchtype=author&query=Law%2C+K+L">Ka Lung Law</a>, <a href="/search/cs?searchtype=author&query=Pan%2C+G">Guanghua Pan</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+J">Jufei Wang</a>, <a href="/search/cs?searchtype=author&query=Gao%2C+X">Xin Gao</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xuan Huang</a>, <a href="/search/cs?searchtype=author&query=Wei%2C+H">Hu Wei</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.11812v1-abstract-short" style="display: inline;"> Unsupervised domain adaptive (UDA) algorithms can markedly enhance the performance of object detectors under conditions of domain shifts, thereby reducing the necessity for extensive labeling and retraining. Current domain adaptive object detection algorithms primarily cater to two-stage detectors, which tend to offer minimal improvements when directly applied to single-stage detectors such as YOL… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.11812v1-abstract-full').style.display = 'inline'; document.getElementById('2412.11812v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.11812v1-abstract-full" style="display: none;"> Unsupervised domain adaptive (UDA) algorithms can markedly enhance the performance of object detectors under conditions of domain shifts, thereby reducing the necessity for extensive labeling and retraining. Current domain adaptive object detection algorithms primarily cater to two-stage detectors, which tend to offer minimal improvements when directly applied to single-stage detectors such as YOLO. Intending to benefit the YOLO detector from UDA, we build a comprehensive domain adaptive architecture using a teacher-student cooperative system for the YOLO detector. In this process, we propose uncertainty learning to cope with pseudo-labeling generated by the teacher model with extreme uncertainty and leverage dynamic data augmentation to asymptotically adapt the teacher-student system to the environment. To address the inability of single-stage object detectors to align at multiple stages, we utilize a unified visual contrastive learning paradigm that aligns instance at backbone and head respectively, which steadily improves the robustness of the detectors in cross-domain tasks. In summary, we present an unsupervised domain adaptive YOLO detector based on visual contrastive learning (CLDA-YOLO), which achieves highly competitive results across multiple domain adaptive datasets without any reduction in inference speed. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.11812v1-abstract-full').style.display = 'none'; document.getElementById('2412.11812v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.11771</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Point Cloud-Assisted Neural Image Compression </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Li%2C+Z">Ziqun Li</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+Q">Qi Zhang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaofeng Huang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Z">Zhao Wang</a>, <a href="/search/cs?searchtype=author&query=Ma%2C+S">Siwei Ma</a>, <a href="/search/cs?searchtype=author&query=Yan%2C+W">Wei Yan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.11771v1-abstract-short" style="display: inline;"> High-efficient image compression is a critical requirement. In several scenarios where multiple modalities of data are captured by different sensors, the auxiliary information from other modalities are not fully leveraged by existing image-only codecs, leading to suboptimal compression efficiency. In this paper, we increase image compression performance with the assistance of point cloud, which is… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.11771v1-abstract-full').style.display = 'inline'; document.getElementById('2412.11771v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.11771v1-abstract-full" style="display: none;"> High-efficient image compression is a critical requirement. In several scenarios where multiple modalities of data are captured by different sensors, the auxiliary information from other modalities are not fully leveraged by existing image-only codecs, leading to suboptimal compression efficiency. In this paper, we increase image compression performance with the assistance of point cloud, which is widely adopted in the area of autonomous driving. We first unify the data representation for both modalities to facilitate data processing. Then, we propose the point cloud-assisted neural image codec (PCA-NIC) to enhance the preservation of image texture and structure by utilizing the high-dimensional point cloud information. We further introduce a multi-modal feature fusion transform module (MMFFT) to capture more representative image features, remove redundant information between channels and modalities that are not relevant to the image content. Our work is the first to improve image compression performance using point cloud and achieves state-of-the-art performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.11771v1-abstract-full').style.display = 'none'; document.getElementById('2412.11771v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.11364</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Individual Bus Trip Chain Prediction and Pattern Identification Considering Similarities </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiannan Huang</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+Y">Yixin Chen</a>, <a href="/search/cs?searchtype=author&query=Yuan%2C+Q">Quan Yuan</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+C">Chao Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.11364v1-abstract-short" style="display: inline;"> Predicting future bus trip chains for an existing user is of great significance for operators of public transit systems. Existing methods always treat this task as a time-series prediction problem, but the 1-dimensional time series structure cannot express the complex relationship between trips. To better capture the inherent patterns in bus travel behavior, this paper proposes a novel approach th… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.11364v1-abstract-full').style.display = 'inline'; document.getElementById('2412.11364v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.11364v1-abstract-full" style="display: none;"> Predicting future bus trip chains for an existing user is of great significance for operators of public transit systems. Existing methods always treat this task as a time-series prediction problem, but the 1-dimensional time series structure cannot express the complex relationship between trips. To better capture the inherent patterns in bus travel behavior, this paper proposes a novel approach that synthesizes future bus trip chains based on those from similar days. Key similarity patterns are defined and tested using real-world data, and a similarity function is then developed to capture these patterns. Afterwards, a graph is constructed where each day is represented as a node and edge weight reflects the similarity between days. Besides, the trips on a given day can be regarded as labels for each node, transferring the bus trip chain prediction problem to a semi-supervised classification problem on a graph. To address this, we propose several methods and validate them on a real-world dataset of 10000 bus users, achieving state-of-the-art prediction results. Analyzing the parameters of similarity function reveals some interesting bus usage patterns, allowing us can to cluster bus users into three types: repeat-dominated, evolve-dominate and repeat-evolve balanced. In summary, our work demonstrates the effectiveness of similarity-based prediction for bus trip chains and provides a new perspective for analyzing individual bus travel patterns. The code for our prediction model is publicly available. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.11364v1-abstract-full').style.display = 'none'; document.getElementById('2412.11364v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.09723</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D Reconstruction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Xu%2C+X">Xiaohao Xu</a>, <a href="/search/cs?searchtype=author&query=Xue%2C+F">Feng Xue</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+S">Shibo Zhao</a>, <a href="/search/cs?searchtype=author&query=Pan%2C+Y">Yike Pan</a>, <a href="/search/cs?searchtype=author&query=Scherer%2C+S">Sebastian Scherer</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaonan Huang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.09723v1-abstract-short" style="display: inline;"> Real-time multi-agent collaboration for ego-motion estimation and high-fidelity 3D reconstruction is vital for scalable spatial intelligence. However, traditional methods produce sparse, low-detail maps, while recent dense mapping approaches struggle with high latency. To overcome these challenges, we present MAC-Ego3D, a novel framework for real-time collaborative photorealistic 3D reconstruction… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.09723v1-abstract-full').style.display = 'inline'; document.getElementById('2412.09723v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.09723v1-abstract-full" style="display: none;"> Real-time multi-agent collaboration for ego-motion estimation and high-fidelity 3D reconstruction is vital for scalable spatial intelligence. However, traditional methods produce sparse, low-detail maps, while recent dense mapping approaches struggle with high latency. To overcome these challenges, we present MAC-Ego3D, a novel framework for real-time collaborative photorealistic 3D reconstruction via Multi-Agent Gaussian Consensus. MAC-Ego3D enables agents to independently construct, align, and iteratively refine local maps using a unified Gaussian splat representation. Through Intra-Agent Gaussian Consensus, it enforces spatial coherence among neighboring Gaussian splats within an agent. For global alignment, parallelized Inter-Agent Gaussian Consensus, which asynchronously aligns and optimizes local maps by regularizing multi-agent Gaussian splats, seamlessly integrates them into a high-fidelity 3D model. Leveraging Gaussian primitives, MAC-Ego3D supports efficient RGB-D rendering, enabling rapid inter-agent Gaussian association and alignment. MAC-Ego3D bridges local precision and global coherence, delivering higher efficiency, largely reducing localization error, and improving mapping fidelity. It establishes a new SOTA on synthetic and real-world benchmarks, achieving a 15x increase in inference speed, order-of-magnitude reductions in ego-motion estimation error for partial cases, and RGB PSNR gains of 4 to 10 dB. Our code will be made publicly available at . <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.09723v1-abstract-full').style.display = 'none'; document.getElementById('2412.09723v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">27 pages, 25 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.09619</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Hu%2C+D">Dongting Hu</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+J">Jierun Chen</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xijie Huang</a>, <a href="/search/cs?searchtype=author&query=Coskun%2C+H">Huseyin Coskun</a>, <a href="/search/cs?searchtype=author&query=Sahni%2C+A">Arpit Sahni</a>, <a href="/search/cs?searchtype=author&query=Gupta%2C+A">Aarush Gupta</a>, <a href="/search/cs?searchtype=author&query=Goyal%2C+A">Anujraaj Goyal</a>, <a href="/search/cs?searchtype=author&query=Lahiri%2C+D">Dishani Lahiri</a>, <a href="/search/cs?searchtype=author&query=Singh%2C+R">Rajesh Singh</a>, <a href="/search/cs?searchtype=author&query=Idelbayev%2C+Y">Yerlan Idelbayev</a>, <a href="/search/cs?searchtype=author&query=Cao%2C+J">Junli Cao</a>, <a href="/search/cs?searchtype=author&query=Li%2C+Y">Yanyu Li</a>, <a href="/search/cs?searchtype=author&query=Cheng%2C+K">Kwang-Ting Cheng</a>, <a href="/search/cs?searchtype=author&query=Chan%2C+S+-+G">S. -H. Gary Chan</a>, <a href="/search/cs?searchtype=author&query=Gong%2C+M">Mingming Gong</a>, <a href="/search/cs?searchtype=author&query=Tulyakov%2C+S">Sergey Tulyakov</a>, <a href="/search/cs?searchtype=author&query=Kag%2C+A">Anil Kag</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+Y">Yanwu Xu</a>, <a href="/search/cs?searchtype=author&query=Ren%2C+J">Jian Ren</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.09619v1-abstract-short" style="display: inline;"> Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, w… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.09619v1-abstract-full').style.display = 'inline'; document.getElementById('2412.09619v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.09619v1-abstract-full" style="display: none;"> Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL). <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.09619v1-abstract-full').style.display = 'none'; document.getElementById('2412.09619v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.09593</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=He%2C+Z">Zexin He</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+T">Tengfei Wang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xin Huang</a>, <a href="/search/cs?searchtype=author&query=Pan%2C+X">Xingang Pan</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Z">Ziwei Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.09593v1-abstract-short" style="display: inline;"> Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature. In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-l… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.09593v1-abstract-full').style.display = 'inline'; document.getElementById('2412.09593v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.09593v1-abstract-full" style="display: none;"> Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature. In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-light diffusion model on a synthetic relighting dataset with dedicated designs. This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions. 2) By using these varied lighting images to reduce estimation uncertainty, we train a large G-buffer model with a U-Net backbone to accurately predict surface normals and materials. Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects. Code and dataset are available on our project page at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.09593v1-abstract-full').style.display = 'none'; document.getElementById('2412.09593v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.09278</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaoshuang Huang</a>, <a href="/search/cs?searchtype=author&query=Shen%2C+L">Lingdong Shen</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+J">Jia Liu</a>, <a href="/search/cs?searchtype=author&query=Shang%2C+F">Fangxin Shang</a>, <a href="/search/cs?searchtype=author&query=Li%2C+H">Hongxiang Li</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+H">Haifeng Huang</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+Y">Yehui Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.09278v1-abstract-short" style="display: inline;"> In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introd… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.09278v1-abstract-full').style.display = 'inline'; document.getElementById('2412.09278v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.09278v1-abstract-full" style="display: none;"> In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.09278v1-abstract-full').style.display = 'none'; document.getElementById('2412.09278v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by AAAI2025</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.08841</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Structural Entropy Guided Probabilistic Coding </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiang Huang</a>, <a href="/search/cs?searchtype=author&query=Peng%2C+H">Hao Peng</a>, <a href="/search/cs?searchtype=author&query=Sun%2C+L">Li Sun</a>, <a href="/search/cs?searchtype=author&query=Lin%2C+H">Hui Lin</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+C">Chunyang Liu</a>, <a href="/search/cs?searchtype=author&query=Cao%2C+J">Jiang Cao</a>, <a href="/search/cs?searchtype=author&query=Yu%2C+P+S">Philip S. Yu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.08841v2-abstract-short" style="display: inline;"> Probabilistic embeddings have several advantages over deterministic embeddings as they map each data point to a distribution, which better describes the uncertainty and complexity of data. Many works focus on adjusting the distribution constraint under the Information Bottleneck (IB) principle to enhance representation learning. However, these proposed regularization terms only consider the constr… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.08841v2-abstract-full').style.display = 'inline'; document.getElementById('2412.08841v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.08841v2-abstract-full" style="display: none;"> Probabilistic embeddings have several advantages over deterministic embeddings as they map each data point to a distribution, which better describes the uncertainty and complexity of data. Many works focus on adjusting the distribution constraint under the Information Bottleneck (IB) principle to enhance representation learning. However, these proposed regularization terms only consider the constraint of each latent variable, omitting the structural information between latent variables. In this paper, we propose a novel structural entropy-guided probabilistic coding model, named SEPC. Specifically, we incorporate the relationship between latent variables into the optimization by proposing a structural entropy regularization loss. Besides, as traditional structural information theory is not well-suited for regression tasks, we propose a probabilistic encoding tree, transferring regression tasks to classification tasks while diminishing the influence of the transformation. Experimental results across 12 natural language understanding tasks, including both classification and regression tasks, demonstrate the superior performance of SEPC compared to other state-of-the-art models in terms of effectiveness, generalization capability, and robustness to label noise. The codes and datasets are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.08841v2-abstract-full').style.display = 'none'; document.getElementById('2412.08841v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 11 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">This paper is accepted by AAAI 2025</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.08109</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Software Engineering">cs.SE</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhang%2C+Y">Yuanliang Zhang</a>, <a href="/search/cs?searchtype=author&query=Xie%2C+Y">Yifan Xie</a>, <a href="/search/cs?searchtype=author&query=Li%2C+S">Shanshan Li</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+K">Ke Liu</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+C">Chong Wang</a>, <a href="/search/cs?searchtype=author&query=Jia%2C+Z">Zhouyang Jia</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiangbing Huang</a>, <a href="/search/cs?searchtype=author&query=Song%2C+J">Jie Song</a>, <a href="/search/cs?searchtype=author&query=Luo%2C+C">Chaopeng Luo</a>, <a href="/search/cs?searchtype=author&query=Zheng%2C+Z">Zhizheng Zheng</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+R">Rulin Xu</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Y">Yitong Liu</a>, <a href="/search/cs?searchtype=author&query=Zheng%2C+S">Si Zheng</a>, <a href="/search/cs?searchtype=author&query=Liao%2C+X">Xiangke Liao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.08109v1-abstract-short" style="display: inline;"> Recently, large language models (LLMs) have shown strong potential in code generation tasks. However, there are still gaps before they can be fully applied in actual software development processes. Accurately assessing the code generation capabilities of large language models has become an important basis for evaluating and improving the models. Some existing works have constructed datasets to eva… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.08109v1-abstract-full').style.display = 'inline'; document.getElementById('2412.08109v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.08109v1-abstract-full" style="display: none;"> Recently, large language models (LLMs) have shown strong potential in code generation tasks. However, there are still gaps before they can be fully applied in actual software development processes. Accurately assessing the code generation capabilities of large language models has become an important basis for evaluating and improving the models. Some existing works have constructed datasets to evaluate the capabilities of these models. However, the current evaluation process may encounter the illusion of "Specialist in Familiarity", primarily due to three gaps: the exposure of target code, case timeliness, and dependency availability. The fundamental reason for these gaps is that the code in current datasets may have been extensively exposed and exercised during the training phase, and due to the continuous training and development of LLM, their timeliness has been severely compromised. The key to solve the problem is to, as much as possible, evaluate the LLMs using code that they have not encountered before. Thus, the fundamental idea in this paper is to draw on the concept of code obfuscation, changing code at different levels while ensuring the functionality and output. To this end, we build a code-obfuscation based benchmark OBFUSEVAL. We first collect 1,354 raw cases from five real-world projects, including function description and code. Then we use three-level strategy (symbol, structure and semantic) to obfuscate descriptions, code and context dependencies. We evaluate four LLMs on OBFU- SEVAL and compared the effectiveness of different obfuscation strategy. We use official test suites of these projects to evaluate the generated code. The results show that after obfuscation, the average decrease ratio of test pass rate can up to 62.5%. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.08109v1-abstract-full').style.display = 'none'; document.getElementById('2412.08109v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Large Language Model,Code Generation Capability,Code Dataset</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.07958</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> PAFFA: Premeditated Actions For Fast Agents </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Krishna%2C+S">Shambhavi Krishna</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+Z">Zheng Chen</a>, <a href="/search/cs?searchtype=author&query=Kumar%2C+V">Vaibhav Kumar</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaojiang Huang</a>, <a href="/search/cs?searchtype=author&query=Li%2C+Y">Yingjie Li</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+F">Fan Yang</a>, <a href="/search/cs?searchtype=author&query=Li%2C+X">Xiang Li</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.07958v1-abstract-short" style="display: inline;"> Modern AI assistants have made significant progress in natural language understanding and API/tool integration, with emerging efforts to incorporate diverse interfaces (such as Web interfaces) for enhanced scalability and functionality. However, current approaches that heavily rely on repeated LLM-driven HTML parsing are computationally expensive and error-prone, particularly when handling dynamic… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.07958v1-abstract-full').style.display = 'inline'; document.getElementById('2412.07958v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.07958v1-abstract-full" style="display: none;"> Modern AI assistants have made significant progress in natural language understanding and API/tool integration, with emerging efforts to incorporate diverse interfaces (such as Web interfaces) for enhanced scalability and functionality. However, current approaches that heavily rely on repeated LLM-driven HTML parsing are computationally expensive and error-prone, particularly when handling dynamic web interfaces and multi-step tasks. To overcome these challenges, we introduce PAFFA (Premeditated Actions For Fast Agents), a framework designed to enhance web interaction capabilities through an Action API Library of reusable, verified browser interaction functions. By pre-computing interaction patterns and employing two core methodologies - "Dist-Map" for task-agnostic element distillation and "Unravel" for incremental page-wise exploration - PAFFA reduces inference calls by 87% while maintaining robust performance even as website structures evolve. This framework accelerates multi-page task execution and offers a scalable solution to advance autonomous web agent research. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.07958v1-abstract-full').style.display = 'none'; document.getElementById('2412.07958v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">9 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.07772</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> From Slow Bidirectional to Fast Causal Video Generators </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Yin%2C+T">Tianwei Yin</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+Q">Qiang Zhang</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+R">Richard Zhang</a>, <a href="/search/cs?searchtype=author&query=Freeman%2C+W+T">William T. Freeman</a>, <a href="/search/cs?searchtype=author&query=Durand%2C+F">Fredo Durand</a>, <a href="/search/cs?searchtype=author&query=Shechtman%2C+E">Eli Shechtman</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xun Huang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.07772v1-abstract-short" style="display: inline;"> Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to a causal transformer that generates frames on-th… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.07772v1-abstract-full').style.display = 'inline'; document.getElementById('2412.07772v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.07772v1-abstract-full" style="display: none;"> Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to a causal transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher's ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model supports fast streaming generation of high quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.07772v1-abstract-full').style.display = 'none'; document.getElementById('2412.07772v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project Page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.07186</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Monte Carlo Tree Search based Space Transfer for Black-box Optimization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Wang%2C+S">Shukuan Wang</a>, <a href="/search/cs?searchtype=author&query=Xue%2C+K">Ke Xue</a>, <a href="/search/cs?searchtype=author&query=Song%2C+L">Lei Song</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaobin Huang</a>, <a href="/search/cs?searchtype=author&query=Qian%2C+C">Chao Qian</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.07186v1-abstract-short" style="display: inline;"> Bayesian optimization (BO) is a popular method for computationally expensive black-box optimization. However, traditional BO methods need to solve new problems from scratch, leading to slow convergence. Recent studies try to extend BO to a transfer learning setup to speed up the optimization, where search space transfer is one of the most promising approaches and has shown impressive performance o… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.07186v1-abstract-full').style.display = 'inline'; document.getElementById('2412.07186v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.07186v1-abstract-full" style="display: none;"> Bayesian optimization (BO) is a popular method for computationally expensive black-box optimization. However, traditional BO methods need to solve new problems from scratch, leading to slow convergence. Recent studies try to extend BO to a transfer learning setup to speed up the optimization, where search space transfer is one of the most promising approaches and has shown impressive performance on many tasks. However, existing search space transfer methods either lack an adaptive mechanism or are not flexible enough, making it difficult to efficiently identify promising search space during the optimization process. In this paper, we propose a search space transfer learning method based on Monte Carlo tree search (MCTS), called MCTS-transfer, to iteratively divide, select, and optimize in a learned subspace. MCTS-transfer can not only provide a well-performing search space for warm-start but also adaptively identify and leverage the information of similar source tasks to reconstruct the search space during the optimization process. Experiments on synthetic functions, real-world problems, Design-Bench and hyper-parameter optimization show that MCTS-transfer can demonstrate superior performance compared to other search space transfer methods under different settings. Our code is available at \url{}. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.07186v1-abstract-full').style.display = 'none'; document.getElementById('2412.07186v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NeurIPS 2024 Spotlight</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.07163</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Fast Occupancy Network </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Lu%2C+M">Mingjie Lu</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+Y">Yuanxian Huang</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+J">Ji Liu</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xingliang Huang</a>, <a href="/search/cs?searchtype=author&query=Li%2C+D">Dong Li</a>, <a href="/search/cs?searchtype=author&query=Peng%2C+J">Jinzhang Peng</a>, <a href="/search/cs?searchtype=author&query=Tian%2C+L">Lu Tian</a>, <a href="/search/cs?searchtype=author&query=Barsoum%2C+E">Emad Barsoum</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.07163v1-abstract-short" style="display: inline;"> Occupancy Network has recently attracted much attention in autonomous driving. Instead of monocular 3D detection and recent bird's eye view(BEV) models predicting 3D bounding box of obstacles, Occupancy Network predicts the category of voxel in specified 3D space around the ego vehicle via transforming 3D detection task into 3D voxel segmentation task, which has much superiority in tackling catego… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.07163v1-abstract-full').style.display = 'inline'; document.getElementById('2412.07163v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.07163v1-abstract-full" style="display: none;"> Occupancy Network has recently attracted much attention in autonomous driving. Instead of monocular 3D detection and recent bird's eye view(BEV) models predicting 3D bounding box of obstacles, Occupancy Network predicts the category of voxel in specified 3D space around the ego vehicle via transforming 3D detection task into 3D voxel segmentation task, which has much superiority in tackling category outlier obstacles and providing fine-grained 3D representation. However, existing methods usually require huge computation resources than previous methods, which hinder the Occupancy Network solution applying in intelligent driving systems. To address this problem, we make an analysis of the bottleneck of Occupancy Network inference cost, and present a simple and fast Occupancy Network model, which adopts a deformable 2D convolutional layer to lift BEV feature to 3D voxel feature and presents an efficient voxel feature pyramid network (FPN) module to improve performance with few computational cost. Further, we present a cost-free 2D segmentation branch in perspective view after feature extractors for Occupancy Network during inference phase to improve accuracy. Experimental results demonstrate that our method consistently outperforms existing methods in both accuracy and inference speed, which surpasses recent state-of-the-art (SOTA) OCCNet by 1.7% with ResNet50 backbone with about 3X inference speedup. Furthermore, our method can be easily applied to existing BEV models to transform them into Occupancy Network models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.07163v1-abstract-full').style.display = 'none'; document.getElementById('2412.07163v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">10 pages, 5 figures,</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.06959</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Geophysics">physics.geo-ph</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Geological and Well prior assisted full waveform inversion using conditional diffusion models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Wang%2C+F">Fu Wang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xinquan Huang</a>, <a href="/search/cs?searchtype=author&query=Alkhalifah%2C+T">Tariq Alkhalifah</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.06959v1-abstract-short" style="display: inline;"> Full waveform inversion (FWI) often faces challenges due to inadequate seismic observations, resulting in band-limited and geologically inaccurate inversion results. Incorporating prior information from potential velocity distributions, well-log information, and our geological knowledge and expectations can significantly improve FWI convergence to a realistic model. While diffusion-regularized FWI… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.06959v1-abstract-full').style.display = 'inline'; document.getElementById('2412.06959v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.06959v1-abstract-full" style="display: none;"> Full waveform inversion (FWI) often faces challenges due to inadequate seismic observations, resulting in band-limited and geologically inaccurate inversion results. Incorporating prior information from potential velocity distributions, well-log information, and our geological knowledge and expectations can significantly improve FWI convergence to a realistic model. While diffusion-regularized FWI has shown improved performance compared to conventional FWI by incorporating the velocity distribution prior, it can benefit even more by incorporating well-log information and other geological knowledge priors. To leverage this fact, we propose a geological class and well-information prior-assisted FWI using conditional diffusion models. This method seamlessly integrates multi-modal information into FWI, simultaneously achieving data fitting and universal geologic and geophysics prior matching, which is often not achieved with traditional regularization methods. Specifically, we propose to combine conditional diffusion models with FWI, where we integrate well-log data and geological class conditions into these conditional diffusion models using classifier-free guidance for multi-modal prior matching beyond the original velocity distribution prior. Numerical experiments on the OpenFWI datasets and field marine data demonstrate the effectiveness of our method compared to conventional FWI and the unconditional diffusion-regularized FWI. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.06959v1-abstract-full').style.display = 'none'; document.getElementById('2412.06959v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.06871</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Predicting Subway Passenger Flows under Incident Situation with Causality </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiannan Huang</a>, <a href="/search/cs?searchtype=author&query=Qiu%2C+S">Shuhan Qiu</a>, <a href="/search/cs?searchtype=author&query=Yuan%2C+Q">Quan Yuan</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+C">Chao Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.06871v1-abstract-short" style="display: inline;"> In the context of rail transit operations, real-time passenger flow prediction is essential; however, most models primarily focus on normal conditions, with limited research addressing incident situations. There are several intrinsic challenges associated with prediction during incidents, such as a lack of interpretability and data scarcity. To address these challenges, we propose a two-stage meth… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.06871v1-abstract-full').style.display = 'inline'; document.getElementById('2412.06871v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.06871v1-abstract-full" style="display: none;"> In the context of rail transit operations, real-time passenger flow prediction is essential; however, most models primarily focus on normal conditions, with limited research addressing incident situations. There are several intrinsic challenges associated with prediction during incidents, such as a lack of interpretability and data scarcity. To address these challenges, we propose a two-stage method that separates predictions under normal conditions and the causal effects of incidents. First, a normal prediction model is trained using data from normal situations. Next, the synthetic control method is employed to identify the causal effects of incidents, combined with placebo tests to determine significant levels of these effects. The significant effects are then utilized to train a causal effect prediction model, which can forecast the impact of incidents based on features of the incidents and passenger flows. During the prediction phase, the results from both the normal situation model and the causal effect prediction model are integrated to generate final passenger flow predictions during incidents. Our approach is validated using real-world data, demonstrating improved accuracy. Furthermore, the two-stage methodology enhances interpretability. By analyzing the causal effect prediction model, we can identify key influencing factors related to the effects of incidents and gain insights into their underlying mechanisms. Our work can assist subway system managers in estimating passenger flow affected by incidents and enable them to take proactive measures. Additionally, it can deepen researchers' understanding of the impact of incidents on subway passenger flows. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.06871v1-abstract-full').style.display = 'none'; document.getElementById('2412.06871v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.06171</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhang%2C+H">Huaxin Zhang</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+X">Xiaohao Xu</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+X">Xiang Wang</a>, <a href="/search/cs?searchtype=author&query=Zuo%2C+J">Jialong Zuo</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaonan Huang</a>, <a href="/search/cs?searchtype=author&query=Gao%2C+C">Changxin Gao</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+S">Shanjun Zhang</a>, <a href="/search/cs?searchtype=author&query=Yu%2C+L">Li Yu</a>, <a href="/search/cs?searchtype=author&query=Sang%2C+N">Nong Sang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.06171v1-abstract-short" style="display: inline;"> How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both sho… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.06171v1-abstract-full').style.display = 'inline'; document.getElementById('2412.06171v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.06171v1-abstract-full" style="display: none;"> How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies. To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments. For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.06171v1-abstract-full').style.display = 'none'; document.getElementById('2412.06171v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">21 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.04292</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huang%2C+Z">Zhenglin Huang</a>, <a href="/search/cs?searchtype=author&query=Hu%2C+J">Jinwei Hu</a>, <a href="/search/cs?searchtype=author&query=Li%2C+X">Xiangtai Li</a>, <a href="/search/cs?searchtype=author&query=He%2C+Y">Yiwei He</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+X">Xingyu Zhao</a>, <a href="/search/cs?searchtype=author&query=Peng%2C+B">Bei Peng</a>, <a href="/search/cs?searchtype=author&query=Wu%2C+B">Baoyuan Wu</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaowei Huang</a>, <a href="/search/cs?searchtype=author&query=Cheng%2C+G">Guangliang Cheng</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.04292v1-abstract-short" style="display: inline;"> The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dat… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.04292v1-abstract-full').style.display = 'inline'; document.getElementById('2412.04292v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.04292v1-abstract-full" style="display: none;"> The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings. The code, model, and dataset will be released. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.04292v1-abstract-full').style.display = 'none'; document.getElementById('2412.04292v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.03571</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Style3D: Attention-guided Multi-view Style Transfer for 3D Object Generation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Song%2C+B">Bingjie Song</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xin Huang</a>, <a href="/search/cs?searchtype=author&query=Xie%2C+R">Ruting Xie</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+X">Xue Wang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Q">Qing Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.03571v1-abstract-short" style="display: inline;"> We present Style3D, a novel approach for generating stylized 3D objects from a content image and a style image. Unlike most previous methods that require case- or style-specific training, Style3D supports instant 3D object stylization. Our key insight is that 3D object stylization can be decomposed into two interconnected processes: multi-view dual-feature alignment and sparse-view spatial reconst… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.03571v1-abstract-full').style.display = 'inline'; document.getElementById('2412.03571v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.03571v1-abstract-full" style="display: none;"> We present Style3D, a novel approach for generating stylized 3D objects from a content image and a style image. Unlike most previous methods that require case- or style-specific training, Style3D supports instant 3D object stylization. Our key insight is that 3D object stylization can be decomposed into two interconnected processes: multi-view dual-feature alignment and sparse-view spatial reconstruction. We introduce MultiFusion Attention, an attention-guided technique to achieve multi-view stylization from the content-style pair. Specifically, the query features from the content image preserve geometric consistency across multiple views, while the key and value features from the style image are used to guide the stylistic transfer. This dual-feature alignment ensures that spatial coherence and stylistic fidelity are maintained across multi-view images. Finally, a large 3D reconstruction model is introduced to generate coherent stylized 3D objects. By establishing an interplay between structural and stylistic features across multiple views, our approach enables a holistic 3D stylization process. Extensive experiments demonstrate that Style3D offers a more flexible and scalable solution for generating style-consistent 3D assets, surpassing existing methods in both computational efficiency and visual quality. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.03571v1-abstract-full').style.display = 'none'; document.getElementById('2412.03571v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.03563</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computers and Society">cs.CY</span> </div> </div> <p class="title is-5 mathjax"> From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Mou%2C+X">Xinyi Mou</a>, <a href="/search/cs?searchtype=author&query=Ding%2C+X">Xuanwen Ding</a>, <a href="/search/cs?searchtype=author&query=He%2C+Q">Qi He</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+L">Liang Wang</a>, <a href="/search/cs?searchtype=author&query=Liang%2C+J">Jingcong Liang</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+X">Xinnong Zhang</a>, <a href="/search/cs?searchtype=author&query=Sun%2C+L">Libo Sun</a>, <a href="/search/cs?searchtype=author&query=Lin%2C+J">Jiayu Lin</a>, <a href="/search/cs?searchtype=author&query=Zhou%2C+J">Jie Zhou</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xuanjing Huang</a>, <a href="/search/cs?searchtype=author&query=Wei%2C+Z">Zhongyu Wei</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.03563v1-abstract-short" style="display: inline;"> Traditional sociological research often relies on human participation, which, though effective, is expensive, challenging to scale, and with ethical concerns. Recent advancements in large language models (LLMs) highlight their potential to simulate human behavior, enabling the replication of individual responses and facilitating studies on many interdisciplinary studies. In this paper, we conduct… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.03563v1-abstract-full').style.display = 'inline'; document.getElementById('2412.03563v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.03563v1-abstract-full" style="display: none;"> Traditional sociological research often relies on human participation, which, though effective, is expensive, challenging to scale, and with ethical concerns. Recent advancements in large language models (LLMs) highlight their potential to simulate human behavior, enabling the replication of individual responses and facilitating studies on many interdisciplinary studies. In this paper, we conduct a comprehensive survey of this field, illustrating the recent progress in simulation driven by LLM-empowered agents. We categorize the simulations into three types: (1) Individual Simulation, which mimics specific individuals or demographic groups; (2) Scenario Simulation, where multiple agents collaborate to achieve goals within specific contexts; and (3) Society Simulation, which models interactions within agent societies to reflect the complexity and variety of real-world dynamics. These simulations follow a progression, ranging from detailed individual modeling to large-scale societal phenomena. We provide a detailed discussion of each simulation type, including the architecture or key components of the simulation, the classification of objectives or scenarios and the evaluation method. Afterward, we summarize commonly used datasets and benchmarks. Finally, we discuss the trends across these three types of simulation. A repository for the related sources is at {\url{}}. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.03563v1-abstract-full').style.display = 'none'; document.getElementById('2412.03563v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.03175</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Theory">cs.IT</span> </div> </div> <p class="title is-5 mathjax"> WMMSE-Based Joint Transceiver Design for Multi-RIS Assisted Cell-free Networks Using Hybrid CSI </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Pan%2C+X">Xuesong Pan</a>, <a href="/search/cs?searchtype=author&query=Zheng%2C+Z">Zhong Zheng</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xueqing Huang</a>, <a href="/search/cs?searchtype=author&query=Fei%2C+Z">Zesong Fei</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.03175v1-abstract-short" style="display: inline;"> In this paper, we consider cell-free communication systems with several access points (APs) serving terrestrial users (UEs) simultaneously. To enhance the uplink multi-user multiple-input multiple-output communications, we adopt a hybrid-CSI-based two-layer distributed multi-user detection scheme comprising the local minimum mean-squared error (MMSE) detection at APs and the one-shot weighted comb… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.03175v1-abstract-full').style.display = 'inline'; document.getElementById('2412.03175v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.03175v1-abstract-full" style="display: none;"> In this paper, we consider cell-free communication systems with several access points (APs) serving terrestrial users (UEs) simultaneously. To enhance the uplink multi-user multiple-input multiple-output communications, we adopt a hybrid-CSI-based two-layer distributed multi-user detection scheme comprising the local minimum mean-squared error (MMSE) detection at APs and the one-shot weighted combining at the central processing unit (CPU). Furthermore, to improve the propagation environment, we introduce multiple reconfigurable intelligent surfaces (RISs) to assist the transmissions from UEs to APs. Aiming to maximize the weighted sum rate, we formulate the weighted sum-MMSE (WMMSE) problem, where the UEs' beamforming matrices, the CPU's weighted combining matrix, and the RISs' phase-shifting matrices are alternately optimized. Considering the limited fronthaul capacity constraint in cell-free networks, we resort to the operator-valued free probability theory to derive the asymptotic alternating optimization (AO) algorithm to solve the WMMSE problem, which only depends on long-term channel statistics and thus reduces the interaction overhead. Numerical results demonstrate that the asymptotic AO algorithm can achieve a high communication rate as well as reduce the interaction overhead. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.03175v1-abstract-full').style.display = 'none'; document.getElementById('2412.03175v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.02211</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Liang%2C+Y">Yaxin Liang</a>, <a href="/search/cs?searchtype=author&query=Li%2C+X">Xinshi Li</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xin Huang</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+Z">Ziqi Zhang</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+Y">Yue Yao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.02211v1-abstract-short" style="display: inline;"> This study proposes an automated data mining framework based on autoencoders and experimentally verifies its effectiveness in feature extraction and data dimensionality reduction. Through the encoding-decoding structure, the autoencoder can capture the data's potential characteristics and achieve noise reduction and anomaly detection, providing an efficient and stable solution for the data mining… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.02211v1-abstract-full').style.display = 'inline'; document.getElementById('2412.02211v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.02211v1-abstract-full" style="display: none;"> This study proposes an automated data mining framework based on autoencoders and experimentally verifies its effectiveness in feature extraction and data dimensionality reduction. Through the encoding-decoding structure, the autoencoder can capture the data's potential characteristics and achieve noise reduction and anomaly detection, providing an efficient and stable solution for the data mining process. The experiment compared the performance of the autoencoder with traditional dimensionality reduction methods (such as PCA, FA, T-SNE, and UMAP). The results showed that the autoencoder performed best in terms of reconstruction error and root mean square error and could better retain data structure and enhance the generalization ability of the model. The autoencoder-based framework not only reduces manual intervention but also significantly improves the automation of data processing. In the future, with the advancement of deep learning and big data technology, the autoencoder method combined with a generative adversarial network (GAN) or graph neural network (GNN) is expected to be more widely used in the fields of complex data processing, real-time data analysis and intelligent decision-making. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.02211v1-abstract-full').style.display = 'none'; document.getElementById('2412.02211v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.02205</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Databases">cs.DB</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> DataLab: A Unified Platform for LLM-Powered Business Intelligence </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Weng%2C+L">Luoxuan Weng</a>, <a href="/search/cs?searchtype=author&query=Tang%2C+Y">Yinghao Tang</a>, <a href="/search/cs?searchtype=author&query=Feng%2C+Y">Yingchaojie Feng</a>, <a href="/search/cs?searchtype=author&query=Chang%2C+Z">Zhuo Chang</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+P">Peng Chen</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+R">Ruiqin Chen</a>, <a href="/search/cs?searchtype=author&query=Feng%2C+H">Haozhe Feng</a>, <a href="/search/cs?searchtype=author&query=Hou%2C+C">Chen Hou</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+D">Danqing Huang</a>, <a href="/search/cs?searchtype=author&query=Li%2C+Y">Yang Li</a>, <a href="/search/cs?searchtype=author&query=Rao%2C+H">Huaming Rao</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+H">Haonan Wang</a>, <a href="/search/cs?searchtype=author&query=Wei%2C+C">Canshi Wei</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+X">Xiaofeng Yang</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+Y">Yuhui Zhang</a>, <a href="/search/cs?searchtype=author&query=Zheng%2C+Y">Yifeng Zheng</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiuqi Huang</a>, <a href="/search/cs?searchtype=author&query=Zhu%2C+M">Minfeng Zhu</a>, <a href="/search/cs?searchtype=author&query=Ma%2C+Y">Yuxin Ma</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+B">Bin Cui</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+W">Wei Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.02205v2-abstract-short" style="display: inline;"> Business intelligence (BI) transforms large volumes of data within modern organizations into actionable insights for informed decision-making. Recently, large language model (LLM)-based agents have streamlined the BI workflow by automatically performing task planning, reasoning, and actions in executable environments based on natural language (NL) queries. However, existing approaches primarily fo… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.02205v2-abstract-full').style.display = 'inline'; document.getElementById('2412.02205v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.02205v2-abstract-full" style="display: none;"> Business intelligence (BI) transforms large volumes of data within modern organizations into actionable insights for informed decision-making. Recently, large language model (LLM)-based agents have streamlined the BI workflow by automatically performing task planning, reasoning, and actions in executable environments based on natural language (NL) queries. However, existing approaches primarily focus on individual BI tasks such as NL2SQL and NL2VIS. The fragmentation of tasks across different data roles and tools lead to inefficiencies and potential errors due to the iterative and collaborative nature of BI. In this paper, we introduce DataLab, a unified BI platform that integrates a one-stop LLM-based agent framework with an augmented computational notebook interface. DataLab supports a wide range of BI tasks for different data roles by seamlessly combining LLM assistance with user customization within a single environment. To achieve this unification, we design a domain knowledge incorporation module tailored for enterprise-specific BI tasks, an inter-agent communication mechanism to facilitate information sharing across the BI workflow, and a cell-based context management strategy to enhance context utilization efficiency in BI notebooks. Extensive experiments demonstrate that DataLab achieves state-of-the-art performance on various BI tasks across popular research benchmarks. Moreover, DataLab maintains high effectiveness and efficiency on real-world datasets from Tencent, achieving up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.02205v2-abstract-full').style.display = 'none'; document.getElementById('2412.02205v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 3 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.01930</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> PROFIT: A Specialized Optimizer for Deep Fine Tuning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Chakravarthy%2C+A+S">Anirudh S Chakravarthy</a>, <a href="/search/cs?searchtype=author&query=Zheng%2C+S+K">Shuai Kyle Zheng</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xin Huang</a>, <a href="/search/cs?searchtype=author&query=Hemachandra%2C+S">Sachithra Hemachandra</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+X">Xiao Zhang</a>, <a href="/search/cs?searchtype=author&query=Chai%2C+Y">Yuning Chai</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+Z">Zhao Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.01930v2-abstract-short" style="display: inline;"> Fine-tuning pre-trained models has become invaluable in computer vision and robotics. Recent fine-tuning approaches focus on improving efficiency rather than accuracy by using a mixture of smaller learning rates or frozen backbones. To return the spotlight to model accuracy, we present PROFIT (Proximally Restricted Optimizer For Iterative Training), one of the first optimizers specifically designe… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.01930v2-abstract-full').style.display = 'inline'; document.getElementById('2412.01930v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.01930v2-abstract-full" style="display: none;"> Fine-tuning pre-trained models has become invaluable in computer vision and robotics. Recent fine-tuning approaches focus on improving efficiency rather than accuracy by using a mixture of smaller learning rates or frozen backbones. To return the spotlight to model accuracy, we present PROFIT (Proximally Restricted Optimizer For Iterative Training), one of the first optimizers specifically designed for incrementally fine-tuning converged models on new tasks or datasets. Unlike traditional optimizers such as SGD or Adam, which make minimal assumptions due to random initialization, PROFIT leverages the structure of a converged model to regularize the optimization process, leading to improved results. By employing a simple temporal gradient orthogonalization process, PROFIT outperforms traditional fine-tuning methods across various tasks: image classification, representation learning, and large-scale motion prediction. Moreover, PROFIT is encapsulated within the optimizer logic, making it easily integrated into any training pipeline with minimal engineering effort. A new class of fine-tuning optimizers like PROFIT can drive advancements as fine-tuning and incremental training become increasingly prevalent, reducing reliance on costly model training from scratch. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.01930v2-abstract-full').style.display = 'none'; document.getElementById('2412.01930v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 2 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">technical report</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.01650</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Privacy-Preserving Federated Learning via Homomorphic Adversarial Networks </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Dong%2C+W">Wenhan Dong</a>, <a href="/search/cs?searchtype=author&query=Lin%2C+C">Chao Lin</a>, <a href="/search/cs?searchtype=author&query=He%2C+X">Xinlei He</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xinyi Huang</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+S">Shengmin Xu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.01650v2-abstract-short" style="display: inline;"> Privacy-preserving federated learning (PPFL) aims to train a global model for multiple clients while maintaining their data privacy. However, current PPFL protocols exhibit one or more of the following insufficiencies: considerable degradation in accuracy, the requirement for sharing keys, and cooperation during the key generation or decryption processes. As a mitigation, we develop the first prot… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.01650v2-abstract-full').style.display = 'inline'; document.getElementById('2412.01650v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.01650v2-abstract-full" style="display: none;"> Privacy-preserving federated learning (PPFL) aims to train a global model for multiple clients while maintaining their data privacy. However, current PPFL protocols exhibit one or more of the following insufficiencies: considerable degradation in accuracy, the requirement for sharing keys, and cooperation during the key generation or decryption processes. As a mitigation, we develop the first protocol that utilizes neural networks to implement PPFL, as well as incorporating an Aggregatable Hybrid Encryption scheme tailored to the needs of PPFL. We name these networks as Homomorphic Adversarial Networks (HANs) which demonstrate that neural networks are capable of performing tasks similar to multi-key homomorphic encryption (MK-HE) while solving the problems of key distribution and collaborative decryption. Our experiments show that HANs are robust against privacy attacks. Compared with non-private federated learning, experiments conducted on multiple datasets demonstrate that HANs exhibit a negligible accuracy loss (at most 1.35%). Compared to traditional MK-HE schemes, HANs increase encryption aggregation speed by 6,075 times while incurring a 29.2 times increase in communication overhead. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.01650v2-abstract-full').style.display = 'none'; document.getElementById('2412.01650v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 2 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.00719</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhao%2C+S">Shuling Zhao</a>, <a href="/search/cs?searchtype=author&query=Hong%2C+F">Fa-Ting Hong</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaoshui Huang</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+D">Dan Xu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.00719v1-abstract-short" style="display: inline;"> Talking head video generation aims to generate a realistic talking head video that preserves the person's identity from a source image and the motion from a driving video. Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously. Essentially, facial motion is often highly co… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.00719v1-abstract-full').style.display = 'inline'; document.getElementById('2412.00719v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.00719v1-abstract-full" style="display: none;"> Talking head video generation aims to generate a realistic talking head video that preserves the person's identity from a source image and the motion from a driving video. Despite the promising progress made in the field, it remains a challenging and critical problem to generate videos with accurate poses and fine-grained facial details simultaneously. Essentially, facial motion is often highly complex to model precisely, and the one-shot source face image cannot provide sufficient appearance guidance during generation due to dynamic pose changes. To tackle the problem, we propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features for talking face image decoding. Specifically, the designed multi-scale motion and appearance codebooks are learned simultaneously in a unified framework to store representative global facial motion flow and appearance patterns. Then, we present a novel multi-scale motion and appearance compensation module, which utilizes a transformer-based codebook retrieval strategy to query complementary information from the two codebooks for joint motion and appearance compensation. The entire process produces motion flows of greater flexibility and appearance features with fewer distortions across different scales, resulting in a high-quality talking head video generation framework. Extensive experiments on various benchmarks validate the effectiveness of our approach and demonstrate superior generation results from both qualitative and quantitative perspectives when compared to state-of-the-art competitors. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.00719v1-abstract-full').style.display = 'none'; document.getElementById('2412.00719v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2412.00087</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computational Engineering, Finance, and Science">cs.CE</span> </div> </div> <p class="title is-5 mathjax"> ONION: Physics-Informed Deep Learning Model for Line Integral Diagnostics Across Fusion Devices </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Wang%2C+C">Cong Wang</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+W">Weizhe Yang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+H">Haiping Wang</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+R">Renjie Yang</a>, <a href="/search/cs?searchtype=author&query=Li%2C+J">Jing Li</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Z">Zhijun Wang</a>, <a href="/search/cs?searchtype=author&query=Yu%2C+X">Xinyao Yu</a>, <a href="/search/cs?searchtype=author&query=Wei%2C+Y">Yixiong Wei</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xianli Huang</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Z">Zhaoyang Liu</a>, <a href="/search/cs?searchtype=author&query=Zou%2C+C">Changqing Zou</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+Z">Zhifeng Zhao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2412.00087v1-abstract-short" style="display: inline;"> This paper introduces a Physics-Informed model architecture that can be adapted to various backbone networks. The model incorporates physical information as additional input and is constrained by a Physics-Informed loss function. Experimental results demonstrate that the additional input of physical information substantially improve the model's ability with a increase in performance observed. Besi… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.00087v1-abstract-full').style.display = 'inline'; document.getElementById('2412.00087v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2412.00087v1-abstract-full" style="display: none;"> This paper introduces a Physics-Informed model architecture that can be adapted to various backbone networks. The model incorporates physical information as additional input and is constrained by a Physics-Informed loss function. Experimental results demonstrate that the additional input of physical information substantially improve the model's ability with a increase in performance observed. Besides, the adoption of the Softplus activation function in the final two fully connected layers significantly enhances model performance. The incorporation of a Physics-Informed loss function has been shown to correct the model's predictions, bringing the back-projections closer to the actual inputs and reducing the errors associated with inversion algorithms. In this work, we have developed a Phantom Data Model to generate customized line integral diagnostic datasets and have also collected SXR diagnostic datasets from EAST and HL-2A. The code, models, and some datasets are publicly available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2412.00087v1-abstract-full').style.display = 'none'; document.getElementById('2412.00087v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.19939</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> VLSBench: Unveiling Visual Leakage in Multimodal Safety </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Hu%2C+X">Xuhao Hu</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+D">Dongrui Liu</a>, <a href="/search/cs?searchtype=author&query=Li%2C+H">Hao Li</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xuanjing Huang</a>, <a href="/search/cs?searchtype=author&query=Shao%2C+J">Jing Shao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.19939v1-abstract-short" style="display: inline;"> Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counter-intuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs trained with image-text pairs. To explain such a counter-intuitive phenomenon, we discover a visual safety… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.19939v1-abstract-full').style.display = 'inline'; document.getElementById('2411.19939v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.19939v1-abstract-full" style="display: none;"> Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counter-intuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs trained with image-text pairs. To explain such a counter-intuitive phenomenon, we discover a visual safety information leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky and sensitive content in the image has been revealed in the textual query. In this way, MLLMs can easily refuse these sensitive text-image queries according to textual queries. However, image-text pairs without VSIL are common in real-world scenarios and are overlooked by existing multimodal safety benchmarks. To this end, we construct multimodal visual leakless safety benchmark (VLSBench) preventing visual safety leakage from image to textual query with 2.4k image-text pairs. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, including LLaVA, Qwen2-VL, Llama3.2-Vision, and GPT-4o. This study demonstrates that textual alignment is enough for multimodal safety scenarios with VSIL, while multimodal alignment is a more promising solution for multimodal safety scenarios without VSIL. Please see our code and data at: <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.19939v1-abstract-full').style.display = 'none'; document.getElementById('2411.19939v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 29 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.19530</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Quantized Delta Weight Is Safety Keeper </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Liu%2C+Y">Yule Liu</a>, <a href="/search/cs?searchtype=author&query=Sun%2C+Z">Zhen Sun</a>, <a href="/search/cs?searchtype=author&query=He%2C+X">Xinlei He</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xinyi Huang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.19530v1-abstract-short" style="display: inline;"> Recent advancements in fine-tuning proprietary language models enable customized applications across various domains but also introduce two major challenges: high resource demands and security risks. Regarding resource demands, recent work proposes novel partial compression, such as BitDelta, to quantize the delta weights between the fine-tuned model and base model. Regarding the security risks, u… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.19530v1-abstract-full').style.display = 'inline'; document.getElementById('2411.19530v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.19530v1-abstract-full" style="display: none;"> Recent advancements in fine-tuning proprietary language models enable customized applications across various domains but also introduce two major challenges: high resource demands and security risks. Regarding resource demands, recent work proposes novel partial compression, such as BitDelta, to quantize the delta weights between the fine-tuned model and base model. Regarding the security risks, user-defined fine-tuning can introduce security vulnerabilities, such as alignment issues, backdoor attacks, and hallucinations. However, most of the current efforts in security assessment focus on the full-precision or full-compression models, it is not well-discussed how the partial compression methods affect security concerns. To bridge this gap, we evaluate the robustness of delta-weight quantization against these security threats. In this paper, we uncover a "free lunch" phenomenon: partial compression can enhance model security against fine-tuning-based attacks with bearable utility loss. Using Llama-2-7b-chat as a case study, we show that, with under 10% utility degradation, the partial compression mitigates alignment-breaking risks by up to 66.17%, harmful backdoor vulnerabilities by 64.46%, and targeted output manipulation risks by up to 90.53%. We further apply LogitLens to visualize internal state transformations during forward passes, suggesting mechanisms for both security failure and recovery in standard versus compressed fine-tuning. This work offers new insights into selecting effective delta compression methods for secure, resource-efficient multi-tenant services. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.19530v1-abstract-full').style.display = 'none'; document.getElementById('2411.19530v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 29 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.19083</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Fu%2C+Y">Yuqian Fu</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+R">Runze Wang</a>, <a href="/search/cs?searchtype=author&query=Fu%2C+Y">Yanwei Fu</a>, <a href="/search/cs?searchtype=author&query=Paudel%2C+D+P">Danda Pani Paudel</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xuanjing Huang</a>, <a href="/search/cs?searchtype=author&query=Van+Gool%2C+L">Luc Van Gool</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.19083v1-abstract-short" style="display: inline;"> In this paper, we focus on the Ego-Exo Object Correspondence task, an emerging challenge in the field of computer vision that aims to map objects across ego-centric and exo-centric views. We introduce ObjectRelator, a novel method designed to tackle this task, featuring two new modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse effectively f… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.19083v1-abstract-full').style.display = 'inline'; document.getElementById('2411.19083v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.19083v1-abstract-full" style="display: none;"> In this paper, we focus on the Ego-Exo Object Correspondence task, an emerging challenge in the field of computer vision that aims to map objects across ego-centric and exo-centric views. We introduce ObjectRelator, a novel method designed to tackle this task, featuring two new modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse effectively fuses language and visual conditions to enhance target object localization, while XObjAlign enforces consistency in object representations across views through a self-supervised alignment strategy. Extensive experiments demonstrate the effectiveness of ObjectRelator, achieving state-of-the-art performance on Ego2Exo and Exo2Ego tasks with minimal additional parameters. This work provides a foundation for future research in comprehensive cross-view object relation understanding highlighting the potential of leveraging multimodal guidance and cross-view alignment. Codes and models will be released to advance further research in this direction. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.19083v1-abstract-full').style.display = 'none'; document.getElementById('2411.19083v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.18329</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Information Theory">cs.IT</span> </div> </div> <p class="title is-5 mathjax"> Two-Timescale Digital Twin Assisted Model Interference and Retraining over Wireless Network </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Cong%2C+J">Jiayi Cong</a>, <a href="/search/cs?searchtype=author&query=Cheng%2C+G">Guoliang Cheng</a>, <a href="/search/cs?searchtype=author&query=You%2C+C">Changsheng You</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xinyu Huang</a>, <a href="/search/cs?searchtype=author&query=Wu%2C+W">Wen Wu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.18329v1-abstract-short" style="display: inline;"> In this paper, we investigate a resource allocation and model retraining problem for dynamic wireless networks by utilizing incremental learning, in which the digital twin (DT) scheme is employed for decision making. A two-timescale framework is proposed for computation resource allocation, mobile user association, and incremental training of user models. To obtain an optimal resource allocation a… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.18329v1-abstract-full').style.display = 'inline'; document.getElementById('2411.18329v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.18329v1-abstract-full" style="display: none;"> In this paper, we investigate a resource allocation and model retraining problem for dynamic wireless networks by utilizing incremental learning, in which the digital twin (DT) scheme is employed for decision making. A two-timescale framework is proposed for computation resource allocation, mobile user association, and incremental training of user models. To obtain an optimal resource allocation and incremental learning policy, we propose an efficient two-timescale scheme based on hybrid DT-physical architecture with the objective to minimize long-term system delay. Specifically, in the large-timescale, base stations will update the user association and implement incremental learning decisions based on statistical state information from the DT system. Then, in the short timescale, an effective computation resource allocation and incremental learning data generated from the DT system is designed based on deep reinforcement learning (DRL), thus reducing the network system's delay in data transmission, data computation, and model retraining steps. Simulation results demonstrate the effectiveness of the proposed two-timescale scheme compared with benchmark schemes. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.18329v1-abstract-full').style.display = 'none'; document.getElementById('2411.18329v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">6 pages, 4 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.17453</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> </div> </div> <p class="title is-5 mathjax"> PEFTGuard: Detecting Backdoor Attacks Against Parameter-Efficient Fine-Tuning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Sun%2C+Z">Zhen Sun</a>, <a href="/search/cs?searchtype=author&query=Cong%2C+T">Tianshuo Cong</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Y">Yule Liu</a>, <a href="/search/cs?searchtype=author&query=Lin%2C+C">Chenhao Lin</a>, <a href="/search/cs?searchtype=author&query=He%2C+X">Xinlei He</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+R">Rongmao Chen</a>, <a href="/search/cs?searchtype=author&query=Han%2C+X">Xingshuo Han</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xinyi Huang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.17453v1-abstract-short" style="display: inline;"> Fine-tuning is an essential process to improve the performance of Large Language Models (LLMs) in specific domains, with Parameter-Efficient Fine-Tuning (PEFT) gaining popularity due to its capacity to reduce computational demands through the integration of low-rank adapters. These lightweight adapters, such as LoRA, can be shared and utilized on open-source platforms. However, adversaries could e… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.17453v1-abstract-full').style.display = 'inline'; document.getElementById('2411.17453v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.17453v1-abstract-full" style="display: none;"> Fine-tuning is an essential process to improve the performance of Large Language Models (LLMs) in specific domains, with Parameter-Efficient Fine-Tuning (PEFT) gaining popularity due to its capacity to reduce computational demands through the integration of low-rank adapters. These lightweight adapters, such as LoRA, can be shared and utilized on open-source platforms. However, adversaries could exploit this mechanism to inject backdoors into these adapters, resulting in malicious behaviors like incorrect or harmful outputs, which pose serious security risks to the community. Unfortunately, few of the current efforts concentrate on analyzing the backdoor patterns or detecting the backdoors in the adapters. To fill this gap, we first construct (and will release) PADBench, a comprehensive benchmark that contains 13,300 benign and backdoored adapters fine-tuned with various datasets, attack strategies, PEFT methods, and LLMs. Moreover, we propose PEFTGuard, the first backdoor detection framework against PEFT-based adapters. Extensive evaluation upon PADBench shows that PEFTGuard outperforms existing detection methods, achieving nearly perfect detection accuracy (100%) in most cases. Notably, PEFTGuard exhibits zero-shot transferability on three aspects, including different attacks, PEFT methods, and adapter ranks. In addition, we consider various adaptive attacks to demonstrate the high robustness of PEFTGuard. We further explore several possible backdoor mitigation defenses, finding fine-mixing to be the most effective method. We envision our benchmark and method can shed light on future LLM backdoor detection research. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.17453v1-abstract-full').style.display = 'none'; document.getElementById('2411.17453v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 26 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">20 pages, 8 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.16807</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Atmospheric and Oceanic Physics"></span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> ADAF: An Artificial Intelligence Data Assimilation Framework for Weather Forecasting </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Xiang%2C+Y">Yanfei Xiang</a>, <a href="/search/cs?searchtype=author&query=Jin%2C+W">Weixin Jin</a>, <a href="/search/cs?searchtype=author&query=Dong%2C+H">Haiyu Dong</a>, <a href="/search/cs?searchtype=author&query=Bai%2C+M">Mingliang Bai</a>, <a href="/search/cs?searchtype=author&query=Fang%2C+Z">Zuliang Fang</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+P">Pengcheng Zhao</a>, <a href="/search/cs?searchtype=author&query=Sun%2C+H">Hongyu Sun</a>, <a href="/search/cs?searchtype=author&query=Thambiratnam%2C+K">Kit Thambiratnam</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+Q">Qi Zhang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaomeng Huang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.16807v1-abstract-short" style="display: inline;"> The forecasting skill of numerical weather prediction (NWP) models critically depends on the accurate initial conditions, also known as analysis, provided by data assimilation (DA). Traditional DA methods often face a trade-off between computational cost and accuracy due to complex linear algebra computations and the high dimensionality of the model, especially in nonlinear systems. Moreover, proc… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.16807v1-abstract-full').style.display = 'inline'; document.getElementById('2411.16807v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.16807v1-abstract-full" style="display: none;"> The forecasting skill of numerical weather prediction (NWP) models critically depends on the accurate initial conditions, also known as analysis, provided by data assimilation (DA). Traditional DA methods often face a trade-off between computational cost and accuracy due to complex linear algebra computations and the high dimensionality of the model, especially in nonlinear systems. Moreover, processing massive data in real-time requires substantial computational resources. To address this, we introduce an artificial intelligence-based data assimilation framework (ADAF) to generate high-quality kilometer-scale analysis. This study is the pioneering work using real-world observations from varied locations and multiple sources to verify the AI method's efficacy in DA, including sparse surface weather observations and satellite imagery. We implemented ADAF for four near-surface variables in the Contiguous United States (CONUS). The results indicate that ADAF surpasses the High Resolution Rapid Refresh Data Assimilation System (HRRRDAS) in accuracy by 16% to 33% for near-surface atmospheric conditions, aligning more closely with actual observations, and can effectively reconstruct extreme events, such as tropical cyclone wind fields. Sensitivity experiments reveal that ADAF can generate high-quality analysis even with low-accuracy backgrounds and extremely sparse surface observations. ADAF can assimilate massive observations within a three-hour window at low computational cost, taking about two seconds on an AMD MI200 graphics processing unit (GPU). ADAF has been shown to be efficient and effective in real-world DA, underscoring its potential role in operational weather forecasting. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.16807v1-abstract-full').style.display = 'none'; document.getElementById('2411.16807v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">29 pages, 15 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.16579</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Xi%2C+Z">Zhiheng Xi</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+D">Dingwen Yang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+J">Jixuan Huang</a>, <a href="/search/cs?searchtype=author&query=Tang%2C+J">Jiafu Tang</a>, <a href="/search/cs?searchtype=author&query=Li%2C+G">Guanyu Li</a>, <a href="/search/cs?searchtype=author&query=Ding%2C+Y">Yiwen Ding</a>, <a href="/search/cs?searchtype=author&query=He%2C+W">Wei He</a>, <a href="/search/cs?searchtype=author&query=Hong%2C+B">Boyang Hong</a>, <a href="/search/cs?searchtype=author&query=Do%2C+S">Shihan Do</a>, <a href="/search/cs?searchtype=author&query=Zhan%2C+W">Wenyu Zhan</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+X">Xiao Wang</a>, <a href="/search/cs?searchtype=author&query=Zheng%2C+R">Rui Zheng</a>, <a href="/search/cs?searchtype=author&query=Ji%2C+T">Tao Ji</a>, <a href="/search/cs?searchtype=author&query=Shi%2C+X">Xiaowei Shi</a>, <a href="/search/cs?searchtype=author&query=Zhai%2C+Y">Yitao Zhai</a>, <a href="/search/cs?searchtype=author&query=Weng%2C+R">Rongxiang Weng</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+J">Jingang Wang</a>, <a href="/search/cs?searchtype=author&query=Cai%2C+X">Xunliang Cai</a>, <a href="/search/cs?searchtype=author&query=Gui%2C+T">Tao Gui</a>, <a href="/search/cs?searchtype=author&query=Wu%2C+Z">Zuxuan Wu</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+Q">Qi Zhang</a>, <a href="/search/cs?searchtype=author&query=Qiu%2C+X">Xipeng Qiu</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xuanjing Huang</a>, <a href="/search/cs?searchtype=author&query=Jiang%2C+Y">Yu-Gang Jiang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.16579v1-abstract-short" style="display: inline;"> Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model's capacity to accurately assess its own performance, which can be limited by factors su… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.16579v1-abstract-full').style.display = 'inline'; document.getElementById('2411.16579v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.16579v1-abstract-full" style="display: none;"> Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model's capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and train-time. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data, resulting in a dataset of $76,321$ responses paired with step-level feedback. Fine-tuning language models with this dataset enables them to generate natural language feedback for mathematical reasoning. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time, especially when scaling up inference-time computation. Motivated by these findings, we introduce the critique-based supervision to the actor's self-training process, and propose a critique-in-the-loop self-improvement method. Experiments show that the method improves the actor's exploration efficiency and solution diversity, especially on challenging queries, leading to a stronger reasoning model. Lastly, we take the preliminary step to explore training self-talk reasoning models via critique supervision and showcase its potential. Our code and datasets are at \href{}{}. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.16579v1-abstract-full').style.display = 'none'; document.getElementById('2411.16579v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Preprint</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.16308</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Qu%2C+W">Wentao Qu</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+J">Jing Wang</a>, <a href="/search/cs?searchtype=author&query=Gong%2C+Y">YongShun Gong</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaoshui Huang</a>, <a href="/search/cs?searchtype=author&query=Xiao%2C+L">Liang Xiao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.16308v2-abstract-short" style="display: inline;"> Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non-DD… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.16308v2-abstract-full').style.display = 'inline'; document.getElementById('2411.16308v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.16308v2-abstract-full" style="display: none;"> Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non-DDPMs. From a different perspective, we delve deeply into the model paradigm dominated by the Conditional Network. In this paper, we propose an end-to-end robust semantic \textbf{Seg}mentation \textbf{Net}work based on a \textbf{C}onditional-Noise Framework (CNF) of D\textbf{D}PMs, named \textbf{CDSegNet}. Specifically, CDSegNet models the Noise Network (NN) as a learnable noise-feature generator. This enables the Conditional Network (CN) to understand 3D scene semantics under multi-level feature perturbations, enhancing the generalization in unseen scenes. Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic labels in a single-step inference like non-DDPMs, due to avoiding directly fitting the scores from semantic labels in the dominant network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet significantly outperforms existing methods, achieving state-of-the-art performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.16308v2-abstract-full').style.display = 'none'; document.getElementById('2411.16308v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 26 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 25 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.15320</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Friedland%2C+G">Gerald Friedland</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xin Huang</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+Y">Yueying Cui</a>, <a href="/search/cs?searchtype=author&query=Kapoor%2C+V">Vishaal Kapoor</a>, <a href="/search/cs?searchtype=author&query=Khetan%2C+A">Ashish Khetan</a>, <a href="/search/cs?searchtype=author&query=Das%2C+S">Sanjiv Das</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.15320v1-abstract-short" style="display: inline;"> We propose PPLqa, an easy to compute, language independent, information-theoretic metric to measure the quality of responses of generative Large Language Models (LLMs) in an unsupervised way, without requiring ground truth annotations or human supervision. The method and metric enables users to rank generative language models for quality of responses, so as to make a selection of the best model fo… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.15320v1-abstract-full').style.display = 'inline'; document.getElementById('2411.15320v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.15320v1-abstract-full" style="display: none;"> We propose PPLqa, an easy to compute, language independent, information-theoretic metric to measure the quality of responses of generative Large Language Models (LLMs) in an unsupervised way, without requiring ground truth annotations or human supervision. The method and metric enables users to rank generative language models for quality of responses, so as to make a selection of the best model for a given task. Our single metric assesses LLMs with an approach that subsumes, but is not explicitly based on, coherence and fluency (quality of writing) and relevance and consistency (appropriateness of response) to the query. PPLqa performs as well as other related metrics, and works better with long-form Q\&A. Thus, PPLqa enables bypassing the lengthy annotation process required for ground truth evaluations, and it also correlates well with human and LLM rankings. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.15320v1-abstract-full').style.display = 'none'; document.getElementById('2411.15320v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">ACM Class:</span> I.2.m; E.4 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.15252</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Deng%2C+F">Fan Deng</a>, <a href="/search/cs?searchtype=author&query=Wu%2C+Y">Yaguang Wu</a>, <a href="/search/cs?searchtype=author&query=Yu%2C+X">Xinyang Yu</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiangjun Huang</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+J">Jian Yang</a>, <a href="/search/cs?searchtype=author&query=Yan%2C+G">Guangyu Yan</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+Q">Qiang Xu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.15252v1-abstract-short" style="display: inline;"> Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances' appea… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.15252v1-abstract-full').style.display = 'inline'; document.getElementById('2411.15252v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.15252v1-abstract-full" style="display: none;"> Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances' appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.15252v1-abstract-full').style.display = 'none'; document.getElementById('2411.15252v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.15138</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Graphics">cs.GR</span> </div> </div> <p class="title is-5 mathjax"> Material Anything: Generating Materials for Any 3D Object via Diffusion </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xin Huang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+T">Tengfei Wang</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Z">Ziwei Liu</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Q">Qing Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.15138v1-abstract-short" style="display: inline;"> We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, e… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.15138v1-abstract-full').style.display = 'inline'; document.getElementById('2411.15138v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.15138v1-abstract-full" style="display: none;"> We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head architecture and rendering loss to improve stability and material quality. Additionally, we introduce confidence masks as a dynamic switcher within the diffusion model, enabling it to effectively handle both textured and texture-less objects across varying lighting conditions. By employing a progressive material generation strategy guided by these confidence masks, along with a UV-space material refiner, our method ensures consistent, UV-ready material outputs. Extensive experiments demonstrate our approach outperforms existing methods across a wide range of object categories and lighting conditions. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.15138v1-abstract-full').style.display = 'none'; document.getElementById('2411.15138v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.14679</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Systems and Control">eess.SY</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Recursive Gaussian Process State Space Model </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zheng%2C+T">Tengjie Zheng</a>, <a href="/search/cs?searchtype=author&query=Cheng%2C+L">Lin Cheng</a>, <a href="/search/cs?searchtype=author&query=Gong%2C+S">Shengping Gong</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xu Huang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.14679v1-abstract-short" style="display: inline;"> Learning dynamical models from data is not only fundamental but also holds great promise for advancing principle discovery, time-series prediction, and controller design. Among various approaches, Gaussian Process State-Space Models (GPSSMs) have recently gained significant attention due to their combination of flexibility and interpretability. However, for online learning, the field lacks an effi… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14679v1-abstract-full').style.display = 'inline'; document.getElementById('2411.14679v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.14679v1-abstract-full" style="display: none;"> Learning dynamical models from data is not only fundamental but also holds great promise for advancing principle discovery, time-series prediction, and controller design. Among various approaches, Gaussian Process State-Space Models (GPSSMs) have recently gained significant attention due to their combination of flexibility and interpretability. However, for online learning, the field lacks an efficient method suitable for scenarios where prior information regarding data distribution and model function is limited. To address this issue, this paper proposes a recursive GPSSM method with adaptive capabilities for both operating domains and Gaussian process (GP) hyperparameters. Specifically, we first utilize first-order linearization to derive a Bayesian update equation for the joint distribution between the system state and the GP model, enabling closed-form and domain-independent learning. Second, an online selection algorithm for inducing points is developed based on informative criteria to achieve lightweight learning. Third, to support online hyperparameter optimization, we recover historical measurement information from the current filtering distribution. Comprehensive evaluations on both synthetic and real-world datasets demonstrate the superior accuracy, computational efficiency, and adaptability of our method compared to state-of-the-art online GPSSM techniques. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14679v1-abstract-full').style.display = 'none'; document.getElementById('2411.14679v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.14466</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1145/3555371 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Learning to Ask: Conversational Product Search via Representation Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zou%2C+J">Jie Zou</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+J+X">Jimmy Xiangji Huang</a>, <a href="/search/cs?searchtype=author&query=Ren%2C+Z">Zhaochun Ren</a>, <a href="/search/cs?searchtype=author&query=Kanoulas%2C+E">Evangelos Kanoulas</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.14466v1-abstract-short" style="display: inline;"> Online shopping platforms, such as Amazon and AliExpress, are increasingly prevalent in society, helping customers purchase products conveniently. With recent progress in natural language processing, researchers and practitioners shift their focus from traditional product search to conversational product search. Conversational product search enables user-machine conversations and through them coll… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14466v1-abstract-full').style.display = 'inline'; document.getElementById('2411.14466v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.14466v1-abstract-full" style="display: none;"> Online shopping platforms, such as Amazon and AliExpress, are increasingly prevalent in society, helping customers purchase products conveniently. With recent progress in natural language processing, researchers and practitioners shift their focus from traditional product search to conversational product search. Conversational product search enables user-machine conversations and through them collects explicit user feedback that allows to actively clarify the users' product preferences. Therefore, prospective research on an intelligent shopping assistant via conversations is indispensable. Existing publications on conversational product search either model conversations independently from users, queries, and products or lead to a vocabulary mismatch. In this work, we propose a new conversational product search model, ConvPS, to assist users in locating desirable items. The model is first trained to jointly learn the semantic representations of user, query, item, and conversation via a unified generative framework. After learning these representations, they are integrated to retrieve the target items in the latent semantic space. Meanwhile, we propose a set of greedy and explore-exploit strategies to learn to ask the user a sequence of high-performance questions for conversations. Our proposed ConvPS model can naturally integrate the representation learning of the user, query, item, and conversation into a unified generative framework, which provides a promising avenue for constructing accurate and robust conversational product search systems that are flexible and adaptive. Experimental results demonstrate that our ConvPS model significantly outperforms state-of-the-art baselines. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14466v1-abstract-full').style.display = 'none'; document.getElementById('2411.14466v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by ACM TOIS</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.14381</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Robotics">cs.RO</span> </div> </div> <p class="title is-5 mathjax"> ETA-IK: Execution-Time-Aware Inverse Kinematics for Dual-Arm Systems </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Tang%2C+Y">Yucheng Tang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xi Huang</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+Y">Yongzhou Zhang</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+T">Tao Chen</a>, <a href="/search/cs?searchtype=author&query=Mamaev%2C+I">Ilshat Mamaev</a>, <a href="/search/cs?searchtype=author&query=Hein%2C+B">Bj枚rn Hein</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.14381v1-abstract-short" style="display: inline;"> This paper presents ETA-IK, a novel Execution-Time-Aware Inverse Kinematics method tailored for dual-arm robotic systems. The primary goal is to optimize motion execution time by leveraging the redundancy of both arms, specifically in tasks where only the relative pose of the robots is constrained, such as dual-arm scanning of unknown objects. Unlike traditional inverse kinematics methods that use… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14381v1-abstract-full').style.display = 'inline'; document.getElementById('2411.14381v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.14381v1-abstract-full" style="display: none;"> This paper presents ETA-IK, a novel Execution-Time-Aware Inverse Kinematics method tailored for dual-arm robotic systems. The primary goal is to optimize motion execution time by leveraging the redundancy of both arms, specifically in tasks where only the relative pose of the robots is constrained, such as dual-arm scanning of unknown objects. Unlike traditional inverse kinematics methods that use surrogate metrics such as joint configuration distance, our method incorporates direct motion execution time and implicit collisions into the optimization process, thereby finding target joints that allow subsequent trajectory generation to get more efficient and collision-free motion. A neural network based execution time approximator is employed to predict time-efficient joint configurations while accounting for potential collisions. Through experimental evaluation on a system composed of a UR5 and a KUKA iiwa robot, we demonstrate significant reductions in execution time. The proposed method outperforms conventional approaches, showing improved motion efficiency without sacrificing positioning accuracy. These results highlight the potential of ETA-IK to improve the performance of dual-arm systems in applications, where efficiency and safety are paramount. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14381v1-abstract-full').style.display = 'none'; document.getElementById('2411.14381v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.14166</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Optimization and Control">math.OC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhu%2C+S">Shuchen Zhu</a>, <a href="/search/cs?searchtype=author&query=Kong%2C+B">Boao Kong</a>, <a href="/search/cs?searchtype=author&query=Lu%2C+S">Songtao Lu</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xinmeng Huang</a>, <a href="/search/cs?searchtype=author&query=Yuan%2C+K">Kun Yuan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.14166v3-abstract-short" style="display: inline;"> This paper studies decentralized bilevel optimization, in which multiple agents collaborate to solve problems involving nested optimization structures with neighborhood communications. Most existing literature primarily utilizes gradient tracking to mitigate the influence of data heterogeneity, without exploring other well-known heterogeneity-correction techniques such as EXTRA or Exact Diffusion.… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14166v3-abstract-full').style.display = 'inline'; document.getElementById('2411.14166v3-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.14166v3-abstract-full" style="display: none;"> This paper studies decentralized bilevel optimization, in which multiple agents collaborate to solve problems involving nested optimization structures with neighborhood communications. Most existing literature primarily utilizes gradient tracking to mitigate the influence of data heterogeneity, without exploring other well-known heterogeneity-correction techniques such as EXTRA or Exact Diffusion. Additionally, these studies often employ identical decentralized strategies for both upper- and lower-level problems, neglecting to leverage distinct mechanisms across different levels. To address these limitations, this paper proposes SPARKLE, a unified Single-loop Primal-dual AlgoRithm frameworK for decentraLized bilEvel optimization. SPARKLE offers the flexibility to incorporate various heterogeneitycorrection strategies into the algorithm. Moreover, SPARKLE allows for different strategies to solve upper- and lower-level problems. We present a unified convergence analysis for SPARKLE, applicable to all its variants, with state-of-the-art convergence rates compared to existing decentralized bilevel algorithms. Our results further reveal that EXTRA and Exact Diffusion are more suitable for decentralized bilevel optimization, and using mixed strategies in bilevel algorithms brings more benefits than relying solely on gradient tracking. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14166v3-abstract-full').style.display = 'none'; document.getElementById('2411.14166v3-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 21 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">74 pages, the Thirty-Eighth Annual Conference on Neural Information Processing Systems (2024)</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.13902</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> PIORS: Personalized Intelligent Outpatient Reception based on Large Language Model with Multi-Agents Medical Scenario Simulation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Bao%2C+Z">Zhijie Bao</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Qingyun Liu</a>, <a href="/search/cs?searchtype=author&query=Guo%2C+Y">Ying Guo</a>, <a href="/search/cs?searchtype=author&query=Ye%2C+Z">Zhengqiang Ye</a>, <a href="/search/cs?searchtype=author&query=Shen%2C+J">Jun Shen</a>, <a href="/search/cs?searchtype=author&query=Xie%2C+S">Shirong Xie</a>, <a href="/search/cs?searchtype=author&query=Peng%2C+J">Jiajie Peng</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xuanjing Huang</a>, <a href="/search/cs?searchtype=author&query=Wei%2C+Z">Zhongyu Wei</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.13902v1-abstract-short" style="display: inline;"> In China, receptionist nurses face overwhelming workloads in outpatient settings, limiting their time and attention for each patient and ultimately reducing service quality. In this paper, we present the Personalized Intelligent Outpatient Reception System (PIORS). This system integrates an LLM-based reception nurse and a collaboration between LLM and hospital information system (HIS) into real ou… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13902v1-abstract-full').style.display = 'inline'; document.getElementById('2411.13902v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.13902v1-abstract-full" style="display: none;"> In China, receptionist nurses face overwhelming workloads in outpatient settings, limiting their time and attention for each patient and ultimately reducing service quality. In this paper, we present the Personalized Intelligent Outpatient Reception System (PIORS). This system integrates an LLM-based reception nurse and a collaboration between LLM and hospital information system (HIS) into real outpatient reception setting, aiming to deliver personalized, high-quality, and efficient reception services. Additionally, to enhance the performance of LLMs in real-world healthcare scenarios, we propose a medical conversational data generation framework named Service Flow aware Medical Scenario Simulation (SFMSS), aiming to adapt the LLM to the real-world environments and PIORS settings. We evaluate the effectiveness of PIORS and SFMSS through automatic and human assessments involving 15 users and 15 clinical experts. The results demonstrate that PIORS-Nurse outperforms all baselines, including the current state-of-the-art model GPT-4o, and aligns with human preferences and clinical needs. Further details and demo can be found at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13902v1-abstract-full').style.display = 'none'; document.getElementById('2411.13902v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.13881</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Data Analysis, Statistics and Probability"></span> </div> </div> <p class="title is-5 mathjax"> Exploring applications of topological data analysis in stock index movement prediction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huang%2C+D">Dazhi Huang</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+P">Pengcheng Xu</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaocheng Huang</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+J">Jiayi Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.13881v1-abstract-short" style="display: inline;"> Topological Data Analysis (TDA) has recently gained significant attention in the field of financial prediction. However, the choice of point cloud construction methods, topological feature representations, and classification models has a substantial impact on prediction results. This paper addresses the classification problem of stock index movement. First, we construct point clouds for stock indi… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13881v1-abstract-full').style.display = 'inline'; document.getElementById('2411.13881v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.13881v1-abstract-full" style="display: none;"> Topological Data Analysis (TDA) has recently gained significant attention in the field of financial prediction. However, the choice of point cloud construction methods, topological feature representations, and classification models has a substantial impact on prediction results. This paper addresses the classification problem of stock index movement. First, we construct point clouds for stock indices using three different methods. Next, we apply TDA to extract topological structures from the point clouds. Four distinct topological features are computed to represent the patterns in the data, and 15 combinations of these features are enumerated and input into six different machine learning models. We evaluate the predictive performance of various TDA configurations by conducting index movement classification tasks on datasets such as CSI, DAX, HSI and FTSE providing insights into the efficiency of different TDA setups. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13881v1-abstract-full').style.display = 'none'; document.getElementById('2411.13881v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">20 pages, 10 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.13291</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> DATAP-SfM: Dynamic-Aware Tracking Any Point for Robust Structure from Motion in the Wild </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Ye%2C+W">Weicai Ye</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+X">Xinyu Chen</a>, <a href="/search/cs?searchtype=author&query=Zhan%2C+R">Ruohao Zhan</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+D">Di Huang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaoshui Huang</a>, <a href="/search/cs?searchtype=author&query=Zhu%2C+H">Haoyi Zhu</a>, <a href="/search/cs?searchtype=author&query=Bao%2C+H">Hujun Bao</a>, <a href="/search/cs?searchtype=author&query=Ouyang%2C+W">Wanli Ouyang</a>, <a href="/search/cs?searchtype=author&query=He%2C+T">Tong He</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+G">Guofeng Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.13291v1-abstract-short" style="display: inline;"> This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild. Traditional frameworks, such as ParticleSfM~\cite{zhao2022particlesfm}, address this problem by sequentially computing the optical flow between adjacent frames to obtain point trajectories. They then remove dynamic trajectories through moti… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13291v1-abstract-full').style.display = 'inline'; document.getElementById('2411.13291v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.13291v1-abstract-full" style="display: none;"> This paper proposes a concise, elegant, and robust pipeline to estimate smooth camera trajectories and obtain dense point clouds for casual videos in the wild. Traditional frameworks, such as ParticleSfM~\cite{zhao2022particlesfm}, address this problem by sequentially computing the optical flow between adjacent frames to obtain point trajectories. They then remove dynamic trajectories through motion segmentation and perform global bundle adjustment. However, the process of estimating optical flow between two adjacent frames and chaining the matches can introduce cumulative errors. Additionally, motion segmentation combined with single-view depth estimation often faces challenges related to scale ambiguity. To tackle these challenges, we propose a dynamic-aware tracking any point (DATAP) method that leverages consistent video depth and point tracking. Specifically, our DATAP addresses these issues by estimating dense point tracking across the video sequence and predicting the visibility and dynamics of each point. By incorporating the consistent video depth prior, the performance of motion segmentation is enhanced. With the integration of DATAP, it becomes possible to estimate and optimize all camera poses simultaneously by performing global bundle adjustments for point tracking classified as static and visible, rather than relying on incremental camera registration. Extensive experiments on dynamic sequences, e.g., Sintel and TUM RGBD dynamic sequences, and on the wild video, e.g., DAVIS, demonstrate that the proposed method achieves state-of-the-art performance in terms of camera pose estimation even in complex dynamic challenge scenes. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13291v1-abstract-full').style.display = 'none'; document.getElementById('2411.13291v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10831</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Neighboring Slice Noise2Noise: Self-Supervised Medical Image Denoising from Single Noisy Image Volume </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhou%2C+L">Langrui Zhou</a>, <a href="/search/cs?searchtype=author&query=Zhou%2C+Z">Ziteng Zhou</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xinyu Huang</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+X">Xiangyu Zhang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+H">Huiru Wang</a>, <a href="/search/cs?searchtype=author&query=Li%2C+G">Guang Li</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10831v2-abstract-short" style="display: inline;"> In the last few years, with the rapid development of deep learning technologies, supervised methods based on convolutional neural networks have greatly enhanced the performance of medical image denoising. However, these methods require large quantities of noisy-clean image pairs for training, which greatly limits their practicality. Although some researchers have attempted to train denoising netwo… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10831v2-abstract-full').style.display = 'inline'; document.getElementById('2411.10831v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10831v2-abstract-full" style="display: none;"> In the last few years, with the rapid development of deep learning technologies, supervised methods based on convolutional neural networks have greatly enhanced the performance of medical image denoising. However, these methods require large quantities of noisy-clean image pairs for training, which greatly limits their practicality. Although some researchers have attempted to train denoising networks using only single noisy images, existing self-supervised methods, including blind-spot-based and data-splitting-based methods, heavily rely on the assumption that noise is pixel-wise independent. However, this assumption often does not hold in real-world medical images. Therefore, in the field of medical imaging, there remains a lack of simple and practical denoising methods that can achieve high-quality denoising performance using only single noisy images. In this paper, we propose a novel self-supervised medical image denoising method, Neighboring Slice Noise2Noise (NS-N2N). The proposed method utilizes neighboring slices within a single noisy image volume to construct weighted training data, and then trains the denoising network using a self-supervised scheme with regional consistency loss and inter-slice continuity loss. NS-N2N only requires a single noisy image volume obtained from one medical imaging procedure to achieve high-quality denoising of the image volume itself. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art self-supervised denoising methods in both denoising performance and processing efficiency. Furthermore, since NS-N2N operates solely in the image domain, it is free from device-specific issues such as reconstruction geometry, making it easier to apply in various clinical practices. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10831v2-abstract-full').style.display = 'none'; document.getElementById('2411.10831v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 December, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.09265</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhou%2C+Z">Zheng Zhou</a>, <a href="/search/cs?searchtype=author&query=Feng%2C+W">Wenquan Feng</a>, <a href="/search/cs?searchtype=author&query=Lyu%2C+S">Shuchang Lyu</a>, <a href="/search/cs?searchtype=author&query=Cheng%2C+G">Guangliang Cheng</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+X">Xiaowei Huang</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+Q">Qi Zhao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.09265v1-abstract-short" style="display: inline;"> Dataset Distillation (DD) is an emerging technique that compresses large-scale datasets into significantly smaller synthesized datasets while preserving high test performance and enabling the efficient training of large models. However, current research primarily focuses on enhancing evaluation accuracy under limited compression ratios, often overlooking critical security concerns such as adversar… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09265v1-abstract-full').style.display = 'inline'; document.getElementById('2411.09265v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.09265v1-abstract-full" style="display: none;"> Dataset Distillation (DD) is an emerging technique that compresses large-scale datasets into significantly smaller synthesized datasets while preserving high test performance and enabling the efficient training of large models. However, current research primarily focuses on enhancing evaluation accuracy under limited compression ratios, often overlooking critical security concerns such as adversarial robustness. A key challenge in evaluating this robustness lies in the complex interactions between distillation methods, model architectures, and adversarial attack strategies, which complicate standardized assessments. To address this, we introduce BEARD, an open and unified benchmark designed to systematically assess the adversarial robustness of DD methods, including DM, IDM, and BACON. BEARD encompasses a variety of adversarial attacks (e.g., FGSM, PGD, C&W) on distilled datasets like CIFAR-10/100 and TinyImageNet. Utilizing an adversarial game framework, it introduces three key metrics: Robustness Ratio (RR), Attack Efficiency Ratio (AE), and Comprehensive Robustness-Efficiency Index (CREI). Our analysis includes unified benchmarks, various Images Per Class (IPC) settings, and the effects of adversarial training. Results are available on the BEARD Leaderboard, along with a library providing model and dataset pools to support reproducible research. Access the code at BEARD. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09265v1-abstract-full').style.display = 'none'; document.getElementById('2411.09265v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">15 pages, 6 figures</span> </p> </li> </ol> <nav class="pagination is-small is-centered breathe-horizontal" role="navigation" aria-label="pagination"> <a href="" class="pagination-previous is-invisible">Previous </a> <a href="/search/?searchtype=author&query=Huang%2C+X&start=50" class="pagination-next" >Next </a> <ul class="pagination-list"> <li> <a 