class="pagination-ellipsis">&hellip;</span></li> </ul> </nav> <ol class="breathe-horizontal" start="1"> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.14401</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yiming Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Z">Zhuokai Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhaorun Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Ding%2C+Z">Zenghui Ding</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+X">Xianjun Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+Y">Yining Sun</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.14401v1-abstract-short" style="display: inline;"> Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14401v1-abstract-full').style.display = 'inline'; document.getElementById('2411.14401v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.14401v1-abstract-full" style="display: none;"> Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14401v1-abstract-full').style.display = 'none'; document.getElementById('2411.14401v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.13873</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Sli2Vol+: Segmenting 3D Medical Images Based on an Object Estimation Guided Correspondence Flow Network </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=An%2C+D">Delin An</a>, <a href="/search/cs?searchtype=author&amp;query=Gu%2C+P">Pengfei Gu</a>, <a href="/search/cs?searchtype=author&amp;query=Sonka%2C+M">Milan Sonka</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+C">Chaoli Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+D+Z">Danny Z. Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.13873v1-abstract-short" style="display: inline;"> Deep learning (DL) methods have shown remarkable successes in medical image segmentation, often using large amounts of annotated data for model training. However, acquiring a large number of diverse labeled 3D medical image datasets is highly difficult and expensive. Recently, mask propagation DL methods were developed to reduce the annotation burden on 3D medical images. For example, Sli2Vol~\cit&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13873v1-abstract-full').style.display = 'inline'; document.getElementById('2411.13873v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.13873v1-abstract-full" style="display: none;"> Deep learning (DL) methods have shown remarkable successes in medical image segmentation, often using large amounts of annotated data for model training. However, acquiring a large number of diverse labeled 3D medical image datasets is highly difficult and expensive. Recently, mask propagation DL methods were developed to reduce the annotation burden on 3D medical images. For example, Sli2Vol~\cite{yeung2021sli2vol} proposed a self-supervised framework (SSF) to learn correspondences by matching neighboring slices via slice reconstruction in the training stage; the learned correspondences were then used to propagate a labeled slice to other slices in the test stage. But, these methods are still prone to error accumulation due to the inter-slice propagation of reconstruction errors. Also, they do not handle discontinuities well, which can occur between consecutive slices in 3D images, as they emphasize exploiting object continuity. To address these challenges, in this work, we propose a new SSF, called \proposed, {for segmenting any anatomical structures in 3D medical images using only a single annotated slice per training and testing volume.} Specifically, in the training stage, we first propagate an annotated 2D slice of a training volume to the other slices, generating pseudo-labels (PLs). Then, we develop a novel Object Estimation Guided Correspondence Flow Network to learn reliable correspondences between consecutive slices and corresponding PLs in a self-supervised manner. In the test stage, such correspondences are utilized to propagate a single annotated slice to the other slices of a test volume. We demonstrate the effectiveness of our method on various medical image segmentation tasks with different datasets, showing better generalizability across different organs, modalities, and modals. Code is available at \url{} <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13873v1-abstract-full').style.display = 'none'; document.getElementById('2411.13873v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.13676</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Hymba: A Hybrid-head Architecture for Small Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Dong%2C+X">Xin Dong</a>, <a href="/search/cs?searchtype=author&amp;query=Fu%2C+Y">Yonggan Fu</a>, <a href="/search/cs?searchtype=author&amp;query=Diao%2C+S">Shizhe Diao</a>, <a href="/search/cs?searchtype=author&amp;query=Byeon%2C+W">Wonmin Byeon</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zijia Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Mahabaleshwarkar%2C+A+S">Ameya Sunil Mahabaleshwarkar</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+S">Shih-Yang Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Van+Keirsbilck%2C+M">Matthijs Van Keirsbilck</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+M">Min-Hung Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Suhara%2C+Y">Yoshi Suhara</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Y">Yingyan Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Kautz%2C+J">Jan Kautz</a>, <a href="/search/cs?searchtype=author&amp;query=Molchanov%2C+P">Pavlo Molchanov</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.13676v1-abstract-short" style="display: inline;"> We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing criti&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13676v1-abstract-full').style.display = 'inline'; document.getElementById('2411.13676v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.13676v1-abstract-full" style="display: none;"> We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the &#34;forced-to-attend&#34; burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13676v1-abstract-full').style.display = 'none'; document.getElementById('2411.13676v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">20 pages, models are available on huggingface</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.13154</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> DMQR-RAG: Diverse Multi-Query Rewriting for RAG </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zhicong Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+J">Jiahao Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+Z">Zhishu Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Mao%2C+H">Hangyu Mao</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhongxia Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Du%2C+J">Jiazhen Du</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yuanxing Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+F">Fuzheng Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+D">Di Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yong Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.13154v1-abstract-short" style="display: inline;"> Large language models often encounter challenges with static knowledge and hallucinations, which undermine their reliability. Retrieval-augmented generation (RAG) mitigates these issues by incorporating external information. However, user queries frequently contain noise and intent deviations, necessitating query rewriting to improve the relevance of retrieved documents. In this paper, we introduc&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13154v1-abstract-full').style.display = 'inline'; document.getElementById('2411.13154v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.13154v1-abstract-full" style="display: none;"> Large language models often encounter challenges with static knowledge and hallucinations, which undermine their reliability. Retrieval-augmented generation (RAG) mitigates these issues by incorporating external information. However, user queries frequently contain noise and intent deviations, necessitating query rewriting to improve the relevance of retrieved documents. In this paper, we introduce DMQR-RAG, a Diverse Multi-Query Rewriting framework designed to improve the performance of both document retrieval and final responses in RAG. Specifically, we investigate how queries with varying information quantities can retrieve a diverse array of documents, presenting four rewriting strategies that operate at different levels of information to enhance the performance of baseline approaches. Additionally, we propose an adaptive strategy selection method that minimizes the number of rewrites while optimizing overall performance. Our methods have been rigorously validated through extensive experiments conducted in both academic and industry settings. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13154v1-abstract-full').style.display = 'none'; document.getElementById('2411.13154v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.13076</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+H">Hao Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+Z">Zhanning Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Ye%2C+M">Maosheng Ye</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhili Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Q">Qifeng Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Cao%2C+T">Tongyi Cao</a>, <a href="/search/cs?searchtype=author&amp;query=Qi%2C+H">Honggang Qi</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.13076v1-abstract-short" style="display: inline;"> In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to represent driving-specific scenarios accurately, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize in&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13076v1-abstract-full').style.display = 'inline'; document.getElementById('2411.13076v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.13076v1-abstract-full" style="display: none;"> In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to represent driving-specific scenarios accurately, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations and enhancing multimodal reasoning for autonomous driving VQA tasks. Extensive experiments confirm the effectiveness of the HoP framework, showing it significantly outperforms previous state-of-the-art methods across all key metrics. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13076v1-abstract-full').style.display = 'none'; document.getElementById('2411.13076v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.12711</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Robotics">cs.RO</span> </div> </div> <p class="title is-5 mathjax"> UBSoft: A Simulation Platform for Robotic Skill Learning in Unbounded Soft Environments </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lin%2C+C">Chunru Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+J">Jugang Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yian Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Z">Zeyuan Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhehuan Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Fang%2C+L">Lixing Fang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+T">Tsun-Hsuan Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Xian%2C+Z">Zhou Xian</a>, <a href="/search/cs?searchtype=author&amp;query=Gan%2C+C">Chuang Gan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.12711v1-abstract-short" style="display: inline;"> It is desired to equip robots with the capability of interacting with various soft materials as they are ubiquitous in the real world. While physics simulations are one of the predominant methods for data collection and robot training, simulating soft materials presents considerable challenges. Specifically, it is significantly more costly than simulating rigid objects in terms of simulation speed&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12711v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12711v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12711v1-abstract-full" style="display: none;"> It is desired to equip robots with the capability of interacting with various soft materials as they are ubiquitous in the real world. While physics simulations are one of the predominant methods for data collection and robot training, simulating soft materials presents considerable challenges. Specifically, it is significantly more costly than simulating rigid objects in terms of simulation speed and storage requirements. These limitations typically restrict the scope of studies on soft materials to small and bounded areas, thereby hindering the learning of skills in broader spaces. To address this issue, we introduce UBSoft, a new simulation platform designed to support unbounded soft environments for robot skill acquisition. Our platform utilizes spatially adaptive resolution scales, where simulation resolution dynamically adjusts based on proximity to active robotic agents. Our framework markedly reduces the demand for extensive storage space and computation costs required for large-scale scenarios involving soft materials. We also establish a set of benchmark tasks in our platform, including both locomotion and manipulation tasks, and conduct experiments to evaluate the efficacy of various reinforcement learning algorithms and trajectory optimization techniques, both gradient-based and sampling-based. Preliminary results indicate that sampling-based trajectory optimization generally achieves better results for obtaining one trajectory to solve the task. Additionally, we conduct experiments in real-world environments to demonstrate that advancements made in our UBSoft simulator could translate to improved robot interactions with large-scale soft material. More videos can be found at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12711v1-abstract-full').style.display = 'none'; document.getElementById('2411.12711v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">CoRL 2024. The first two authors contributed equally to this paper</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.12530</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Contourlet Refinement Gate Framework for Thermal Spectrum Distribution Regularized Infrared Image Super-Resolution </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zou%2C+Y">Yang Zou</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhixin Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Z">Zhipeng Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+X">Xingyuan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+L">Long Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+J">Jinyuan Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+P">Peng Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yanning Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.12530v1-abstract-short" style="display: inline;"> Image super-resolution (SR) is a classical yet still active low-level vision problem that aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts, serving as a key technique for image enhancement. Current approaches to address SR tasks, such as transformer-based and diffusion-based methods, are either dedicated to extracting RGB image features or assuming simila&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12530v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12530v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12530v1-abstract-full" style="display: none;"> Image super-resolution (SR) is a classical yet still active low-level vision problem that aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts, serving as a key technique for image enhancement. Current approaches to address SR tasks, such as transformer-based and diffusion-based methods, are either dedicated to extracting RGB image features or assuming similar degradation patterns, neglecting the inherent modal disparities between infrared and visible images. When directly applied to infrared image SR tasks, these methods inevitably distort the infrared spectral distribution, compromising the machine perception in downstream tasks. In this work, we emphasize the infrared spectral distribution fidelity and propose a Contourlet refinement gate framework to restore infrared modal-specific features while preserving spectral distribution fidelity. Our approach captures high-pass subbands from multi-scale and multi-directional infrared spectral decomposition to recover infrared-degraded information through a gate architecture. The proposed Spectral Fidelity Loss regularizes the spectral frequency distribution during reconstruction, which ensures the preservation of both high- and low-frequency components and maintains the fidelity of infrared-specific features. We propose a two-stage prompt-learning optimization to guide the model in learning infrared HR characteristics from LR degradation. Extensive experiments demonstrate that our approach outperforms existing image SR models in both visual and perceptual tasks while notably enhancing machine perception in downstream tasks. Our code is available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12530v1-abstract-full').style.display = 'none'; document.getElementById('2411.12530v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">13 figures, 6 tables</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">MSC Class:</span> 68T45 <span class="has-text-black-bis has-text-weight-semibold">ACM Class:</span> I.4.3 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.12426</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Motif Channel Opened in a White-Box: Stereo Matching via Motif Correlation Graph </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Ziyang Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yongjun Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+W">Wenting Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+B">Bingshu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yong Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+C+L+P">C. L. Philip Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.12426v1-abstract-short" style="display: inline;"> Real-world applications of stereo matching, such as autonomous driving, place stringent demands on both safety and accuracy. However, learning-based stereo matching methods inherently suffer from the loss of geometric structures in certain feature channels, creating a bottleneck in achieving precise detail matching. Additionally, these methods lack interpretability due to the black-box nature of d&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12426v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12426v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12426v1-abstract-full" style="display: none;"> Real-world applications of stereo matching, such as autonomous driving, place stringent demands on both safety and accuracy. However, learning-based stereo matching methods inherently suffer from the loss of geometric structures in certain feature channels, creating a bottleneck in achieving precise detail matching. Additionally, these methods lack interpretability due to the black-box nature of deep learning. In this paper, we propose MoCha-V2, a novel learning-based paradigm for stereo matching. MoCha-V2 introduces the Motif Correlation Graph (MCG) to capture recurring textures, which are referred to as ``motifs&#34; within feature channels. These motifs reconstruct geometric structures and are learned in a more interpretable way. Subsequently, we integrate features from multiple frequency domains through wavelet inverse transformation. The resulting motif features are utilized to restore geometric structures in the stereo matching process. Experimental results demonstrate the effectiveness of MoCha-V2. MoCha-V2 achieved 1st place on the Middlebury benchmark at the time of its release. Code is available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12426v1-abstract-full').style.display = 'none'; document.getElementById('2411.12426v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.12363</a> <span>&nbsp;[<a href="">pdf</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> DGSNA: prompt-based Dynamic Generative Scene-based Noise Addition method </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zihao Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhentao Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Zeng%2C+B">Bi Zeng</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+L">Linyi Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zhi Li</a>, <a href="/search/cs?searchtype=author&amp;query=Cai%2C+J">Jia Cai</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.12363v1-abstract-short" style="display: inline;"> This paper addresses the challenges of accurately enumerating and describing scenes and the labor-intensive process required to replicate acoustic environments using non-generative methods. We introduce the prompt-based Dynamic Generative Sce-ne-based Noise Addition method (DGSNA), which innovatively combines the Dynamic Generation of Scene Information (DGSI) with Scene-based Noise Addition for Au&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12363v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12363v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12363v1-abstract-full" style="display: none;"> This paper addresses the challenges of accurately enumerating and describing scenes and the labor-intensive process required to replicate acoustic environments using non-generative methods. We introduce the prompt-based Dynamic Generative Sce-ne-based Noise Addition method (DGSNA), which innovatively combines the Dynamic Generation of Scene Information (DGSI) with Scene-based Noise Addition for Audio (SNAA). Employing generative chat models structured within the Back-ground-Examples-Task (BET) prompt framework, DGSI com-ponent facilitates the dynamic synthesis of tailored Scene Infor-mation (SI) for specific acoustic environments. Additionally, the SNAA component leverages Room Impulse Response (RIR) fil-ters and Text-To-Audio (TTA) systems to generate realistic, scene-based noise that can be adapted for both indoor and out-door environments. Through comprehensive experiments, the adaptability of DGSNA across different generative chat models was demonstrated. The results, assessed through both objective and subjective evaluations, show that DGSNA provides robust performance in dynamically generating precise SI and effectively enhancing scene-based noise addition capabilities, thus offering significant improvements over traditional methods in acoustic scene simulation. Our implementation and demos are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12363v1-abstract-full').style.display = 'none'; document.getElementById('2411.12363v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.12083</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Human-Computer Interaction">cs.HC</span> </div> </div> <p class="title is-5 mathjax"> Extended-Use Designs on Very Large Online Platforms </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Y">Yixin Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Fu%2C+Y">Yue Fu</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zeya Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Radesky%2C+J">Jenny Radesky</a>, <a href="/search/cs?searchtype=author&amp;query=Hiniker%2C+A">Alexis Hiniker</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.12083v1-abstract-short" style="display: inline;"> In the attention economy, online platforms are incentivized to maximize user engagement through extended-use designs (EUDs), even when such practices conflict with users&#39; best interests. We conducted a structured content analysis of all Very Large Online Platforms (VLOPs) to identify the EUDs these influential apps and sites use. We conducted this analysis posing as a teenager to understand the EU&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12083v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12083v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12083v1-abstract-full" style="display: none;"> In the attention economy, online platforms are incentivized to maximize user engagement through extended-use designs (EUDs), even when such practices conflict with users&#39; best interests. We conducted a structured content analysis of all Very Large Online Platforms (VLOPs) to identify the EUDs these influential apps and sites use. We conducted this analysis posing as a teenager to understand the EUDs that young people are exposed to. We find that VLOPs use four strategies to promote extended use: pressuring, enticing, trapping, and lulling users. We report on a hierarchical taxonomy organizing the 63 designs that fall under these categories. Applying this taxonomy to all 17 VLOPs, we identify 583 instances of EUDs, with social media platforms using twice as many EUDs as other VLOPs. We present three vignettes illustrating how these designs reinforce one another in practice. We further contribute a graphical dataset of videos illustrating these features in the wild. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12083v1-abstract-full').style.display = 'none'; document.getElementById('2411.12083v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">29 pages, 23 figures, open source Github page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.11694</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+J">Jinhao Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhipeng Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Min%2C+Y">Yingqian Min</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+J">Jie Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Cheng%2C+X">Xiaoxue Cheng</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+J">Jiapeng Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+Y">Yiru Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+H">Haoxiang Sun</a>, <a href="/search/cs?searchtype=author&amp;query=Deng%2C+J">Jia Deng</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+W+X">Wayne Xin Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Z">Zheng Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Yan%2C+D">Dong Yan</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+J">Jian Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zhongyuan Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wen%2C+J">Ji-Rong Wen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.11694v1-abstract-short" style="display: inline;"> Recently, test-time scaling has garnered significant attention from the research community, largely due to the substantial advancements of the o1 model released by OpenAI. By allocating more computational resources during the inference phase, large language models~(LLMs) can extensively explore the solution space by generating more thought tokens or diverse solutions, thereby producing more accura&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11694v1-abstract-full').style.display = 'inline'; document.getElementById('2411.11694v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.11694v1-abstract-full" style="display: none;"> Recently, test-time scaling has garnered significant attention from the research community, largely due to the substantial advancements of the o1 model released by OpenAI. By allocating more computational resources during the inference phase, large language models~(LLMs) can extensively explore the solution space by generating more thought tokens or diverse solutions, thereby producing more accurate responses. However, developing an o1-like reasoning approach is challenging, and researchers have been making various attempts to advance this open area of research. In this paper, we present a preliminary exploration into enhancing the reasoning abilities of LLMs through reward-guided tree search algorithms. This framework is implemented by integrating the policy model, reward model, and search algorithm. It is primarily constructed around a tree search algorithm, where the policy model navigates a dynamically expanding tree guided by a specially trained reward model. We thoroughly explore various design considerations necessary for implementing this framework and provide a detailed report of the technical aspects. To assess the effectiveness of our approach, we focus on mathematical reasoning tasks and conduct extensive evaluations on four challenging datasets, significantly enhancing the reasoning abilities of LLMs. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11694v1-abstract-full').style.display = 'none'; document.getElementById('2411.11694v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">LLM;Complex Reasoning;Math</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10936</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Iterative Camera-LiDAR Extrinsic Optimization via Surrogate Diffusion </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ou%2C+N">Ni Ou</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhuo Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xinru Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+J">Junzheng Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10936v1-abstract-short" style="display: inline;"> Cameras and LiDAR are essential sensors for autonomous vehicles. Camera-LiDAR data fusion compensate for deficiencies of stand-alone sensors but relies on precise extrinsic calibration. Many learning-based calibration methods predict extrinsic parameters in a single step. Driven by the growing demand for higher accuracy, a few approaches utilize multi-range models or integrate multiple methods to&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10936v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10936v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10936v1-abstract-full" style="display: none;"> Cameras and LiDAR are essential sensors for autonomous vehicles. Camera-LiDAR data fusion compensate for deficiencies of stand-alone sensors but relies on precise extrinsic calibration. Many learning-based calibration methods predict extrinsic parameters in a single step. Driven by the growing demand for higher accuracy, a few approaches utilize multi-range models or integrate multiple methods to improve extrinsic parameter predictions, but these strategies incur extended training times and require additional storage for separate models. To address these issues, we propose a single-model iterative approach based on surrogate diffusion to significantly enhance the capacity of individual calibration methods. By applying a buffering technique proposed by us, the inference time of our surrogate diffusion is 43.7% less than that of multi-range models. Additionally, we create a calibration network as our denoiser, featuring both projection-first and encoding-first branches for effective point feature extraction. Extensive experiments demonstrate that our diffusion model outperforms other single-model iterative methods and delivers competitive results compared to multi-range models. Our denoiser exceeds state-of-the-art calibration methods, reducing the rotation error by 24.5% compared to the second-best method. Furthermore, with the proposed diffusion applied, it achieves 20.4% less rotation error and 9.6% less translation error. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10936v1-abstract-full').style.display = 'none'; document.getElementById('2411.10936v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">11 pages, 4 figures, 3 tables</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10912</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> SPICA: Retrieving Scenarios for Pluralistic In-Context Alignment </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Q+Z">Quan Ze Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+K+J+K">K. J. Kevin Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Park%2C+C+Y">Chan Young Park</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+A+X">Amy X. Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10912v1-abstract-short" style="display: inline;"> Alignment of large language models (LLMs) to societal values should account for pluralistic values from diverse groups. One technique uses in-context learning for inference-time alignment, but only considers similarity when drawing few-shot examples, not accounting for cross-group differences in value prioritization. We propose SPICA, a framework for pluralistic alignment that accounts for group-l&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10912v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10912v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10912v1-abstract-full" style="display: none;"> Alignment of large language models (LLMs) to societal values should account for pluralistic values from diverse groups. One technique uses in-context learning for inference-time alignment, but only considers similarity when drawing few-shot examples, not accounting for cross-group differences in value prioritization. We propose SPICA, a framework for pluralistic alignment that accounts for group-level differences during in-context example retrieval. SPICA introduces three designs to facilitate pluralistic alignment: scenario banks, group-informed metrics, and in-context alignment prompts. From an evaluation of SPICA on an alignment task collecting inputs from four demographic groups ($n = 544$), our metrics retrieve in-context examples that more closely match observed preferences, with the best prompt configuration using multiple contrastive responses to demonstrate examples. In an end-to-end evaluation ($n = 80$), we observe that SPICA-aligned models are higher rated than a baseline similarity-only retrieval approach, with groups seeing up to a +0.16 point improvement on a 5 point scale. Additionally, gains from SPICA were more uniform, with all groups benefiting from alignment rather than only some. Finally, we find that while a group-agnostic approach can effectively align to aggregated values, it is not most suited for aligning to divergent groups. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10912v1-abstract-full').style.display = 'none'; document.getElementById('2411.10912v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10819</a> <span>&nbsp;[<a href="">pdf</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> An Oversampling-enhanced Multi-class Imbalanced Classification Framework for Patient Health Status Prediction Using Patient-reported Outcomes </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yan%2C+Y">Yang Yan</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhong Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+C">Cai Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+X">Xinglei Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Shiao%2C+J">Jay Shiao</a>, <a href="/search/cs?searchtype=author&amp;query=Einck%2C+J">John Einck</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+R+C">Ronald C Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+H">Hao Gao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10819v1-abstract-short" style="display: inline;"> Patient-reported outcomes (PROs) directly collected from cancer patients being treated with radiation therapy play a vital role in assisting clinicians in counseling patients regarding likely toxicities. Precise prediction and evaluation of symptoms or health status associated with PROs are fundamental to enhancing decision-making and planning for the required services and support as patients tran&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10819v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10819v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10819v1-abstract-full" style="display: none;"> Patient-reported outcomes (PROs) directly collected from cancer patients being treated with radiation therapy play a vital role in assisting clinicians in counseling patients regarding likely toxicities. Precise prediction and evaluation of symptoms or health status associated with PROs are fundamental to enhancing decision-making and planning for the required services and support as patients transition into survivorship. However, the raw PRO data collected from hospitals exhibits some intrinsic challenges such as incomplete item reports and imbalance patient toxicities. To the end, in this study, we explore various machine learning techniques to predict patient outcomes related to health status such as pain levels and sleep discomfort using PRO datasets from a cancer photon/proton therapy center. Specifically, we deploy six advanced machine learning classifiers -- Random Forest (RF), XGBoost, Gradient Boosting (GB), Support Vector Machine (SVM), Multi-Layer Perceptron with Bagging (MLP-Bagging), and Logistic Regression (LR) -- to tackle a multi-class imbalance classification problem across three prevalent cancer types: head and neck, prostate, and breast cancers. To address the class imbalance issue, we employ an oversampling strategy, adjusting the training set sample sizes through interpolations of in-class neighboring samples, thereby augmenting minority classes without deviating from the original skewed class distribution. Our experimental findings across multiple PRO datasets indicate that the RF and XGB methods achieve robust generalization performance, evidenced by weighted AUC and detailed confusion matrices, in categorizing outcomes as mild, intermediate, and severe post-radiation therapy. These results underscore the models&#39; effectiveness and potential utility in clinical settings. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10819v1-abstract-full').style.display = 'none'; document.getElementById('2411.10819v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">10 pages, 12 figures, 4 tables</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10534</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Human-Computer Interaction">cs.HC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computers and Society">cs.CY</span> </div> </div> <p class="title is-5 mathjax"> Chain of Alignment: Integrating Public Will with Expert Intelligence for Language Model Alignment </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Konya%2C+A">Andrew Konya</a>, <a href="/search/cs?searchtype=author&amp;query=Ovadya%2C+A">Aviv Ovadya</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+K">Kevin Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Q+Z">Quan Ze Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Schirch%2C+L">Lisa Schirch</a>, <a href="/search/cs?searchtype=author&amp;query=Irwin%2C+C">Colin Irwin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+A+X">Amy X. Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10534v1-abstract-short" style="display: inline;"> We introduce a method to measure the alignment between public will and language model (LM) behavior that can be applied to fine-tuning, online oversight, and pre-release safety checks. Our `chain of alignment&#39; (CoA) approach produces a rule based reward (RBR) by creating model behavior $\textit{rules}$ aligned to normative $\textit{objectives}$ aligned to $\textit{public will}$. This factoring ena&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10534v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10534v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10534v1-abstract-full" style="display: none;"> We introduce a method to measure the alignment between public will and language model (LM) behavior that can be applied to fine-tuning, online oversight, and pre-release safety checks. Our `chain of alignment&#39; (CoA) approach produces a rule based reward (RBR) by creating model behavior $\textit{rules}$ aligned to normative $\textit{objectives}$ aligned to $\textit{public will}$. This factoring enables a nonexpert public to directly specify their will through the normative objectives, while expert intelligence is used to figure out rules entailing model behavior that best achieves those objectives. We validate our approach by applying it across three different domains of LM prompts related to mental health. We demonstrate a public input process built on collective dialogues and bridging-based ranking that reliably produces normative objectives supported by at least $96\% \pm 2\%$ of the US public. We then show that rules developed by mental health experts to achieve those objectives enable a RBR that evaluates an LM response&#39;s alignment with the objectives similarly to human experts (Pearson&#39;s $r=0.841$, $AUC=0.964$). By measuring alignment with objectives that have near unanimous public support, these CoA RBRs provide an approximate measure of alignment between LM behavior and public will. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10534v1-abstract-full').style.display = 'none'; document.getElementById('2411.10534v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Pluralistic Alignment Workshop at NeurIPS 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10442</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Weiyun Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhe Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Wenhai Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Cao%2C+Y">Yue Cao</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yangzhou Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+Z">Zhangwei Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+J">Jinguo Zhu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+X">Xizhou Zhu</a>, <a href="/search/cs?searchtype=author&amp;query=Lu%2C+L">Lewei Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Qiao%2C+Y">Yu Qiao</a>, <a href="/search/cs?searchtype=author&amp;query=Dai%2C+J">Jifeng Dai</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10442v1-abstract-short" style="display: inline;"> Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reaso&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10442v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10442v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10442v1-abstract-full" style="display: none;"> Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10442v1-abstract-full').style.display = 'none'; document.getElementById('2411.10442v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10161</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zewen Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+J">Juan Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Wen Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+S">Sunhan Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+H">Hang Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Zeng%2C+Y">Yun Zeng</a>, <a href="/search/cs?searchtype=author&amp;query=Guo%2C+J">Jian Guo</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+S">Shuxun Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Yuan%2C+C">Chunfeng Yuan</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+B">Bing Li</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+W">Weiming Hu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10161v1-abstract-short" style="display: inline;"> Existing Image Quality Assessment (IQA) methods achieve remarkable success in analyzing quality for overall image, but few works explore quality analysis for Regions of Interest (ROIs). The quality analysis of ROIs can provide fine-grained guidance for image quality improvement and is crucial for scenarios focusing on region-level quality. This paper proposes a novel network, SEAGULL, which can SE&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10161v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10161v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10161v1-abstract-full" style="display: none;"> Existing Image Quality Assessment (IQA) methods achieve remarkable success in analyzing quality for overall image, but few works explore quality analysis for Regions of Interest (ROIs). The quality analysis of ROIs can provide fine-grained guidance for image quality improvement and is crucial for scenarios focusing on region-level quality. This paper proposes a novel network, SEAGULL, which can SEe and Assess ROIs quality with GUidance from a Large vision-Language model. SEAGULL incorporates a vision-language model (VLM), masks generated by Segment Anything Model (SAM) to specify ROIs, and a meticulously designed Mask-based Feature Extractor (MFE) to extract global and local tokens for specified ROIs, enabling accurate fine-grained IQA for ROIs. Moreover, this paper constructs two ROI-based IQA datasets, SEAGULL-100w and SEAGULL-3k, for training and evaluating ROI-based IQA. SEAGULL-100w comprises about 100w synthetic distortion images with 33 million ROIs for pre-training to improve the model&#39;s ability of regional quality perception, and SEAGULL-3k contains about 3k authentic distortion ROIs to enhance the model&#39;s ability to perceive real world distortions. After pre-training on SEAGULL-100w and fine-tuning on SEAGULL-3k, SEAGULL shows remarkable performance on fine-grained ROI quality assessment. Code and datasets are publicly available at the <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10161v1-abstract-full').style.display = 'none'; document.getElementById('2411.10161v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10136</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> CoSAM: Self-Correcting SAM for Domain Generalization in 2D Medical Image Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fu%2C+Y">Yihang Fu</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Ziyang Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Ye%2C+Y">Yiwen Ye</a>, <a href="/search/cs?searchtype=author&amp;query=Lei%2C+X">Xingliang Lei</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zhisong Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Xia%2C+Y">Yong Xia</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10136v1-abstract-short" style="display: inline;"> Medical images often exhibit distribution shifts due to variations in imaging protocols and scanners across different medical centers. Domain Generalization (DG) methods aim to train models on source domains that can generalize to unseen target domains. Recently, the segment anything model (SAM) has demonstrated strong generalization capabilities due to its prompt-based design, and has gained sign&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10136v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10136v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10136v1-abstract-full" style="display: none;"> Medical images often exhibit distribution shifts due to variations in imaging protocols and scanners across different medical centers. Domain Generalization (DG) methods aim to train models on source domains that can generalize to unseen target domains. Recently, the segment anything model (SAM) has demonstrated strong generalization capabilities due to its prompt-based design, and has gained significant attention in image segmentation tasks. Existing SAM-based approaches attempt to address the need for manual prompts by introducing prompt generators that automatically generate these prompts. However, we argue that auto-generated prompts may not be sufficiently accurate under distribution shifts, potentially leading to incorrect predictions that still require manual verification and correction by clinicians. To address this challenge, we propose a method for 2D medical image segmentation called Self-Correcting SAM (CoSAM). Our approach begins by generating coarse masks using SAM in a prompt-free manner, providing prior prompts for the subsequent stages, and eliminating the need for prompt generators. To automatically refine these coarse masks, we introduce a generalized error decoder that simulates the correction process typically performed by clinicians. Furthermore, we generate diverse prompts as feedback based on the corrected masks, which are used to iteratively refine the predictions within a self-correcting loop, enhancing the generalization performance of our model. Extensive experiments on two medical image segmentation benchmarks across multiple scenarios demonstrate the superiority of CoSAM over state-of-the-art SAM-based methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10136v1-abstract-full').style.display = 'none'; document.getElementById('2411.10136v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10060</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Multimedia">cs.MM</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+X">Xiaofei Zhu</a>, <a href="/search/cs?searchtype=author&amp;query=Cheng%2C+J">Jiawei Cheng</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Z">Zhou Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhuo Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Q">Qingyang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Yao%2C+J">Jianfeng Yao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10060v1-abstract-short" style="display: inline;"> Multimodal emotion recognition in conversation (MER) aims to accurately identify emotions in conversational utterances by integrating multimodal information. Previous methods usually treat multimodal information as equal quality and employ symmetric architectures to conduct multimodal fusion. However, in reality, the quality of different modalities usually varies considerably, and utilizing a symm&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10060v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10060v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10060v1-abstract-full" style="display: none;"> Multimodal emotion recognition in conversation (MER) aims to accurately identify emotions in conversational utterances by integrating multimodal information. Previous methods usually treat multimodal information as equal quality and employ symmetric architectures to conduct multimodal fusion. However, in reality, the quality of different modalities usually varies considerably, and utilizing a symmetric architecture is difficult to accurately recognize conversational emotions when dealing with uneven modal information. Furthermore, fusing multi-modality information in a single granularity may fail to adequately integrate modal information, exacerbating the inaccuracy in emotion recognition. In this paper, we propose a novel Cross-Modality Augmented Transformer with Hierarchical Variational Distillation, called CMATH, which consists of two major components, i.e., Multimodal Interaction Fusion and Hierarchical Variational Distillation. The former is comprised of two submodules, including Modality Reconstruction and Cross-Modality Augmented Transformer (CMA-Transformer), where Modality Reconstruction focuses on obtaining high-quality compressed representation of each modality, and CMA-Transformer adopts an asymmetric fusion strategy which treats one modality as the central modality and takes others as auxiliary modalities. The latter first designs a variational fusion network to fuse the fine-grained representations learned by CMA- Transformer into a coarse-grained representations. Then, it introduces a hierarchical distillation framework to maintain the consistency between modality representations with different granularities. Experiments on the IEMOCAP and MELD datasets demonstrate that our proposed model outperforms previous state-of-the-art baselines. Implementation codes can be available at cjw-MER/CMATH. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10060v1-abstract-full').style.display = 'none'; document.getElementById('2411.10060v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.09823</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yian Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Qiu%2C+X">Xiaowen Qiu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+J">Jiageng Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhehuan Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Cai%2C+J">Jiting Cai</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yufei Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+T">Tsun-Hsuan Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Xian%2C+Z">Zhou Xian</a>, <a href="/search/cs?searchtype=author&amp;query=Gan%2C+C">Chuang Gan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.09823v1-abstract-short" style="display: inline;"> Creating large-scale interactive 3D environments is essential for the development of Robotics and Embodied AI research. Current methods, including manual design, procedural generation, diffusion-based scene generation, and large language model (LLM) guided scene design, are hindered by limitations such as excessive human effort, reliance on predefined rules or training datasets, and limited 3D spa&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09823v1-abstract-full').style.display = 'inline'; document.getElementById('2411.09823v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.09823v1-abstract-full" style="display: none;"> Creating large-scale interactive 3D environments is essential for the development of Robotics and Embodied AI research. Current methods, including manual design, procedural generation, diffusion-based scene generation, and large language model (LLM) guided scene design, are hindered by limitations such as excessive human effort, reliance on predefined rules or training datasets, and limited 3D spatial reasoning ability. Since pre-trained 2D image generative models better capture scene and object configuration than LLMs, we address these challenges by introducing Architect, a generative framework that creates complex and realistic 3D embodied environments leveraging diffusion-based 2D image inpainting. In detail, we utilize foundation visual perception models to obtain each generated object from the image and leverage pre-trained depth estimation models to lift the generated 2D image to 3D space. Our pipeline is further extended to a hierarchical and iterative inpainting process to continuously generate placement of large furniture and small objects to enrich the scene. This iterative structure brings the flexibility for our method to generate or refine scenes from various starting points, such as text, floor plans, or pre-arranged environments. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09823v1-abstract-full').style.display = 'none'; document.getElementById('2411.09823v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.08340</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> DyConfidMatch: Dynamic Thresholding and Re-sampling for 3D Semi-supervised Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhimin Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+B">Bing Li</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.08340v1-abstract-short" style="display: inline;"> Semi-supervised learning (SSL) leverages limited labeled and abundant unlabeled data but often faces challenges with data imbalance, especially in 3D contexts. This study investigates class-level confidence as an indicator of learning status in 3D SSL, proposing a novel method that utilizes dynamic thresholding to better use unlabeled data, particularly from underrepresented classes. A re-sampling&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.08340v1-abstract-full').style.display = 'inline'; document.getElementById('2411.08340v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.08340v1-abstract-full" style="display: none;"> Semi-supervised learning (SSL) leverages limited labeled and abundant unlabeled data but often faces challenges with data imbalance, especially in 3D contexts. This study investigates class-level confidence as an indicator of learning status in 3D SSL, proposing a novel method that utilizes dynamic thresholding to better use unlabeled data, particularly from underrepresented classes. A re-sampling strategy is also introduced to mitigate bias towards well-represented classes, ensuring equitable class representation. Through extensive experiments in 3D SSL, our method surpasses state-of-the-art counterparts in classification and detection tasks, highlighting its effectiveness in tackling data imbalance. This approach presents a significant advancement in SSL for 3D datasets, providing a robust solution for data imbalance issues. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.08340v1-abstract-full').style.display = 'none'; document.getElementById('2411.08340v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by Pattern Recognition Journal</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.07591</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Overcoming the Curse of Dimensionality in Reinforcement Learning Through Approximate Factorization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lu%2C+C">Chenbei Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+L">Laixi Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zaiwei Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+C">Chenye Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Wierman%2C+A">Adam Wierman</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.07591v1-abstract-short" style="display: inline;"> Reinforcement Learning (RL) algorithms are known to suffer from the curse of dimensionality, which refers to the fact that large-scale problems often lead to exponentially high sample complexity. A common solution is to use deep neural networks for function approximation; however, such approaches typically lack theoretical guarantees. To provably address the curse of dimensionality, we observe tha&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07591v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07591v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07591v1-abstract-full" style="display: none;"> Reinforcement Learning (RL) algorithms are known to suffer from the curse of dimensionality, which refers to the fact that large-scale problems often lead to exponentially high sample complexity. A common solution is to use deep neural networks for function approximation; however, such approaches typically lack theoretical guarantees. To provably address the curse of dimensionality, we observe that many real-world problems exhibit task-specific model structures that, when properly leveraged, can improve the sample efficiency of RL. Building on this insight, we propose overcoming the curse of dimensionality by approximately factorizing the original Markov decision processes (MDPs) into smaller, independently evolving MDPs. This factorization enables the development of sample-efficient RL algorithms in both model-based and model-free settings, with the latter involving a variant of variance-reduced Q-learning. We provide improved sample complexity guarantees for both proposed algorithms. Notably, by leveraging model structure through the approximate factorization of the MDP, the dependence of sample complexity on the size of the state-action space can be exponentially reduced. Numerically, we demonstrate the practicality of our proposed methods through experiments on both synthetic MDP tasks and a wind farm-equipped storage control problem. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07591v1-abstract-full').style.display = 'none'; document.getElementById('2411.07591v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">61 pages, 10 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.07569</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> </div> </div> <p class="title is-5 mathjax"> Towards Automated Model Design on Recommender Systems </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+T">Tunhou Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Cheng%2C+D">Dehua Cheng</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+Y">Yuchen He</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhengxing Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Dai%2C+X">Xiaoliang Dai</a>, <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+L">Liang Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yudong Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Cheng%2C+F">Feng Cheng</a>, <a href="/search/cs?searchtype=author&amp;query=Cao%2C+Y">Yufan Cao</a>, <a href="/search/cs?searchtype=author&amp;query=Yan%2C+F">Feng Yan</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+H">Hai Li</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Y">Yiran Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wen%2C+W">Wei Wen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.07569v1-abstract-short" style="display: inline;"> The increasing popularity of deep learning models has created new opportunities for developing AI-based recommender systems. Designing recommender systems using deep neural networks requires careful architecture design, and further optimization demands extensive co-design efforts on jointly optimizing model architecture and hardware. Design automation, such as Automated Machine Learning (AutoML),&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07569v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07569v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07569v1-abstract-full" style="display: none;"> The increasing popularity of deep learning models has created new opportunities for developing AI-based recommender systems. Designing recommender systems using deep neural networks requires careful architecture design, and further optimization demands extensive co-design efforts on jointly optimizing model architecture and hardware. Design automation, such as Automated Machine Learning (AutoML), is necessary to fully exploit the potential of recommender model design, including model choices and model-hardware co-design strategies. We introduce a novel paradigm that utilizes weight sharing to explore abundant solution spaces. Our paradigm creates a large supernet to search for optimal architectures and co-design strategies to address the challenges of data multi-modality and heterogeneity in the recommendation domain. From a model perspective, the supernet includes a variety of operators, dense connectivity, and dimension search options. From a co-design perspective, it encompasses versatile Processing-In-Memory (PIM) configurations to produce hardware-efficient models. Our solution space&#39;s scale, heterogeneity, and complexity pose several challenges, which we address by proposing various techniques for training and evaluating the supernet. Our crafted models show promising results on three Click-Through Rates (CTR) prediction benchmarks, outperforming both manually designed and AutoML-crafted models with state-of-the-art performance when focusing solely on architecture search. From a co-design perspective, we achieve 2x FLOPs efficiency, 1.8x energy efficiency, and 1.5x performance improvements in recommender models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07569v1-abstract-full').style.display = 'none'; document.getElementById('2411.07569v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted in ACM Transactions on Recommender Systems. arXiv admin note: substantial text overlap with arXiv:2207.07187</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> ACM Transactions on Recommender Systems (TORS) 2024 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.07510</a> <span>&nbsp;[<a href="">pdf</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> </div> </div> <p class="title is-5 mathjax"> An Attack Traffic Identification Method Based on Temporal Spectrum </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Xie%2C+W">Wenwei Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Yin%2C+J">Jie Yin</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zihao Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.07510v1-abstract-short" style="display: inline;"> To address the issues of insufficient robustness, unstable features, and data noise interference in existing network attack detection and identification models, this paper proposes an attack traffic detection and identification method based on temporal spectrum. First, traffic data is segmented by a sliding window to construct a feature sequence and a corresponding label sequence for network traff&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07510v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07510v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07510v1-abstract-full" style="display: none;"> To address the issues of insufficient robustness, unstable features, and data noise interference in existing network attack detection and identification models, this paper proposes an attack traffic detection and identification method based on temporal spectrum. First, traffic data is segmented by a sliding window to construct a feature sequence and a corresponding label sequence for network traffic. Next, the proposed spectral label generation methods, SSPE and COAP, are applied to transform the label sequence into spectral labels and the feature sequence into temporal features. Spectral labels and temporal features are used to capture and represent behavioral patterns of attacks. Finally, the constructed temporal features and spectral labels are used to train models, which subsequently detects and identifies network attack behaviors. Experimental results demonstrate that compared to traditional methods, models trained with the SSPE or COAP method improve identification accuracy by 10%, and exhibit strong robustness, particularly in noisy environments. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07510v1-abstract-full').style.display = 'none'; document.getElementById('2411.07510v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">20 pages, 7 figures, 7 tables, 8 formulas</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.07228</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computational Engineering, Finance, and Science">cs.CE</span> </div> </div> <p class="title is-5 mathjax"> Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yu%2C+B">Botao Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Baker%2C+F+N">Frazier N. Baker</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Ziru Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Herb%2C+G">Garrett Herb</a>, <a href="/search/cs?searchtype=author&amp;query=Gou%2C+B">Boyu Gou</a>, <a href="/search/cs?searchtype=author&amp;query=Adu-Ampratwum%2C+D">Daniel Adu-Ampratwum</a>, <a href="/search/cs?searchtype=author&amp;query=Ning%2C+X">Xia Ning</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+H">Huan Sun</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.07228v1-abstract-short" style="display: inline;"> To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemAgent, an enhanced chemistry agent over ChemCrow, and c&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07228v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07228v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07228v1-abstract-full" style="display: none;"> To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents&#39; ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07228v1-abstract-full').style.display = 'none'; document.getElementById('2411.07228v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.07025</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Graphics">cs.GR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Scaling Mesh Generation via Compressive Tokenization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Weng%2C+H">Haohan Weng</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Z">Zibo Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Lei%2C+B">Biwen Lei</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+X">Xianghui Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+J">Jian Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Lai%2C+Z">Zeqiang Lai</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhuo Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yuhong Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+J">Jie Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Guo%2C+C">Chunchao Guo</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+T">Tong Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+S">Shenghua Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+C+L+P">C. L. Philip Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.07025v1-abstract-short" style="display: inline;"> We propose a compressive yet effective mesh representation, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. BPT compresses mesh sequences by employing block-wise indexing and patch aggregation, reducing their length by approximately 75\% compared to the original sequences. This compression milestone unlocks the potential to utilize mesh data wit&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07025v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07025v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07025v1-abstract-full" style="display: none;"> We propose a compressive yet effective mesh representation, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. BPT compresses mesh sequences by employing block-wise indexing and patch aggregation, reducing their length by approximately 75\% compared to the original sequences. This compression milestone unlocks the potential to utilize mesh data with significantly more faces, thereby enhancing detail richness and improving generation robustness. Empowered with the BPT, we have built a foundation mesh generative model training on scaled mesh data to support flexible control for point clouds and images. Our model demonstrates the capability to generate meshes with intricate details and accurate topology, achieving SoTA performance on mesh generation and reaching the level for direct product usage. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07025v1-abstract-full').style.display = 'none'; document.getElementById('2411.07025v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Homepage: , Code:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.07019</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> UniHR: Hierarchical Representation Learning for Unified Knowledge Graph Link Prediction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Z">Zhiqiang Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+M">Mingyang Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Hua%2C+Y">Yin Hua</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhuo Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Z">Ziqi Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Liang%2C+L">Lei Liang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+H">Huajun Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+W">Wen Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.07019v1-abstract-short" style="display: inline;"> Beyond-triple fact representations including hyper-relational facts with auxiliary key-value pairs, temporal facts with additional timestamps, and nested facts implying relationships between facts, are gaining significant attention. However, existing link prediction models are usually designed for one specific type of facts, making it difficult to generalize to other fact representations. To overc&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07019v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07019v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07019v1-abstract-full" style="display: none;"> Beyond-triple fact representations including hyper-relational facts with auxiliary key-value pairs, temporal facts with additional timestamps, and nested facts implying relationships between facts, are gaining significant attention. However, existing link prediction models are usually designed for one specific type of facts, making it difficult to generalize to other fact representations. To overcome this limitation, we propose a Unified Hierarchical Representation learning framework (UniHR) for unified knowledge graph link prediction. It consists of a unified Hierarchical Data Representation (HiDR) module and a unified Hierarchical Structure Learning (HiSL) module as graph encoder. The HiDR module unifies hyper-relational KGs, temporal KGs, and nested factual KGs into triple-based representations. Then HiSL incorporates intra-fact and inter-fact message passing, focusing on enhancing the semantic information within individual facts and enriching the structural information between facts. Experimental results across 7 datasets from 3 types of KGs demonstrate that our UniHR outperforms baselines designed for one specific kind of KG, indicating strong generalization capability of HiDR form and the effectiveness of HiSL module. Code and data are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07019v1-abstract-full').style.display = 'none'; document.getElementById('2411.07019v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.06908</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> EVQAScore: Efficient Video Question Answering Data Evaluation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liang%2C+H">Hao Liang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zirong Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+W">Wentao Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.06908v1-abstract-short" style="display: inline;"> Video question-answering (QA) is a core task in video understanding. Evaluating the quality of video QA and video caption data quality for training video large language models (VideoLLMs) is an essential challenge. Although various methods have been proposed for assessing video caption quality, there remains a lack of dedicated evaluation methods for Video QA. To address this gap, we introduce EVQ&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06908v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06908v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06908v1-abstract-full" style="display: none;"> Video question-answering (QA) is a core task in video understanding. Evaluating the quality of video QA and video caption data quality for training video large language models (VideoLLMs) is an essential challenge. Although various methods have been proposed for assessing video caption quality, there remains a lack of dedicated evaluation methods for Video QA. To address this gap, we introduce EVQAScore, a reference-free method that leverages keyword extraction to assess both video caption and video QA data quality. Additionally, we incorporate frame sampling and rescaling techniques to enhance the efficiency and robustness of our evaluation, this enables our score to evaluate the quality of extremely long videos. Our approach achieves state-of-the-art (SOTA) performance (32.8 for Kendall correlation and 42.3 for Spearman correlation, 4.7 and 5.9 higher than the previous method PAC-S++) on the VATEX-EVAL benchmark for video caption evaluation. Furthermore, by using EVQAScore for data selection, we achieved SOTA results with only 12.5\% of the original data volume, outperforming the previous SOTA method PAC-S and 100\% of data. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06908v1-abstract-full').style.display = 'none'; document.getElementById('2411.06908v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.06770</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhijie Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Q">Qiaobo Li</a>, <a href="/search/cs?searchtype=author&amp;query=Banerjee%2C+A">Arindam Banerjee</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.06770v2-abstract-short" style="display: inline;"> Combining gradient compression methods (e.g., CountSketch, quantization) and adaptive optimizers (e.g., Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential benefits on both fewer communication rounds and less per-round communication. In spite of the preliminary empirical success of sketched adaptive methods, existing convergence analyses show the communication cost to hav&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06770v2-abstract-full').style.display = 'inline'; document.getElementById('2411.06770v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06770v2-abstract-full" style="display: none;"> Combining gradient compression methods (e.g., CountSketch, quantization) and adaptive optimizers (e.g., Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential benefits on both fewer communication rounds and less per-round communication. In spite of the preliminary empirical success of sketched adaptive methods, existing convergence analyses show the communication cost to have a linear dependence on the ambient dimension, i.e., number of parameters, which is prohibitively high for modern deep learning models. In this work, we introduce specific sketched adaptive federated learning (SAFL) algorithms and, as our main contribution, provide theoretical convergence analyses in different FL settings with guarantees on communication cost depending only logarithmically (instead of linearly) on the ambient dimension. Unlike existing analyses, we show that the entry-wise sketching noise existent in the preconditioners and the first moments of SAFL can be implicitly addressed by leveraging the recently-popularized anisotropic curvatures in deep learning losses, e.g., fast decaying loss Hessian eigen-values. In the i.i.d. client setting of FL, we show that SAFL achieves asymptotic $O(1/\sqrt{T})$ convergence, and converges faster in the initial epochs. In the non-i.i.d. client setting, where non-adaptive methods lack convergence guarantees, we show that SACFL (SAFL with clipping) algorithms can provably converge in spite of the additional heavy-tailed noise. Our theoretical claims are supported by empirical studies on vision and language tasks, and in both fine-tuning and training-from-scratch regimes. Surprisingly, as a by-product of our analysis, the proposed SAFL methods are competitive with the state-of-the-art communication-efficient federated learning algorithms based on error feedback. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06770v2-abstract-full').style.display = 'none'; document.getElementById('2411.06770v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 11 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.06702</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Track Any Peppers: Weakly Supervised Sweet Pepper Tracking Using VLMs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lim%2C+J+S">Jia Syuen Lim</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+Y">Yadan Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhi Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wei%2C+T">Tianqi Wei</a>, <a href="/search/cs?searchtype=author&amp;query=Chapman%2C+S">Scott Chapman</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+Z">Zi Huang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.06702v1-abstract-short" style="display: inline;"> In the Detection and Multi-Object Tracking of Sweet Peppers Challenge, we present Track Any Peppers (TAP) - a weakly supervised ensemble technique for sweet peppers tracking. TAP leverages the zero-shot detection capabilities of vision-language foundation models like Grounding DINO to automatically generate pseudo-labels for sweet peppers in video sequences with minimal human intervention. These p&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06702v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06702v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06702v1-abstract-full" style="display: none;"> In the Detection and Multi-Object Tracking of Sweet Peppers Challenge, we present Track Any Peppers (TAP) - a weakly supervised ensemble technique for sweet peppers tracking. TAP leverages the zero-shot detection capabilities of vision-language foundation models like Grounding DINO to automatically generate pseudo-labels for sweet peppers in video sequences with minimal human intervention. These pseudo-labels, refined when necessary, are used to train a YOLOv8 segmentation network. To enhance detection accuracy under challenging conditions, we incorporate pre-processing techniques such as relighting adjustments and apply depth-based filtering during post-inference. For object tracking, we integrate the Matching by Segment Anything (MASA) adapter with the BoT-SORT algorithm. Our approach achieves a HOTA score of 80.4%, MOTA of 66.1%, Recall of 74.0%, and Precision of 90.7%, demonstrating effective tracking of sweet peppers without extensive manual effort. This work highlights the potential of foundation models for efficient and accurate object detection and tracking in agricultural settings. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06702v1-abstract-full').style.display = 'none'; document.getElementById('2411.06702v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.06681</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Distributed, Parallel, and Cluster Computing">cs.DC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Information Theory">cs.IT</span> </div> </div> <p class="title is-5 mathjax"> WDMoE: Wireless Distributed Mixture of Experts for Large Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Xue%2C+N">Nan Xue</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+Y">Yaping Sun</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhiyong Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Tao%2C+M">Meixia Tao</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+X">Xiaodong Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Qian%2C+L">Liang Qian</a>, <a href="/search/cs?searchtype=author&amp;query=Cui%2C+S">Shuguang Cui</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+W">Wenjun Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+P">Ping Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.06681v1-abstract-short" style="display: inline;"> Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but the role of wireless networks in supporting LLMs has not been thoroughly explored. In this paper, we propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wirel&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06681v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06681v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06681v1-abstract-full" style="display: none;"> Large Language Models (LLMs) have achieved significant success in various natural language processing tasks, but the role of wireless networks in supporting LLMs has not been thoroughly explored. In this paper, we propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks. Specifically, we decompose the MoE layer in LLMs by placing the gating network and the preceding neural network layer at BS, while distributing the expert networks among the devices. This deployment leverages the parallel inference capabilities of expert networks on mobile devices, effectively utilizing the limited computing and caching resources of these devices. Accordingly, we develop a performance metric for WDMoE-based LLMs, which accounts for both model capability and latency. To minimize the latency while maintaining accuracy, we jointly optimize expert selection and bandwidth allocation based on the performance metric. Moreover, we build a hardware testbed using NVIDIA Jetson kits to validate the effectiveness of WDMoE. Both theoretical simulations and practical hardware experiments demonstrate that the proposed method can significantly reduce the latency without compromising LLM performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06681v1-abstract-full').style.display = 'none'; document.getElementById('2411.06681v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.06558</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhennan Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Yajie Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+H">Haofan Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhibo Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+Z">Zhengkai Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+J">Jun Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Q">Qian Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+J">Jian Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Tai%2C+Y">Ying Tai</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.06558v2-abstract-short" style="display: inline;"> Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control st&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06558v2-abstract-full').style.display = 'inline'; document.getElementById('2411.06558v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06558v2-abstract-full" style="display: none;"> Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. RAG decouple the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAG novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06558v2-abstract-full').style.display = 'none'; document.getElementById('2411.06558v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 10 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Code is available at</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.06207</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Z">Zhen Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+X">Xinyu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+Y">Yong Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhuo Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Mu%2C+F">Feiteng Mu</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+M">Mengting Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+P">Pengjun Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+F">Fei Huang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.06207v1-abstract-short" style="display: inline;"> Large Language Models (LLMs) are increasingly recognized for their practical applications. However, these models often encounter challenges in dynamically changing knowledge, as well as in managing unknown static knowledge. Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs. Actually, we find that the impact of RAG on the question answering capab&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06207v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06207v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06207v1-abstract-full" style="display: none;"> Large Language Models (LLMs) are increasingly recognized for their practical applications. However, these models often encounter challenges in dynamically changing knowledge, as well as in managing unknown static knowledge. Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs. Actually, we find that the impact of RAG on the question answering capabilities of LLMs can be categorized into three groups: beneficial, neutral, and harmful. By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs, while also improving the overall performance of LLMs. This insight motivates us to differentiate between types of questions using certain metrics as indicators, to decrease the retrieval ratio without compromising performance. In our work, we propose a method that is able to identify different types of questions from this view by training a Knowledge Boundary Model (KBM). Experiments conducted on 11 English and Chinese datasets illustrate that the KBM effectively delineates the knowledge boundary, significantly decreasing the proportion of retrievals required for optimal end-to-end performance. Specifically, we evaluate the effectiveness of KBM in three complex scenarios: dynamic knowledge, long-tail static knowledge, and multi-hop problems, as well as its functionality as an external LLM plug-in. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06207v1-abstract-full').style.display = 'none'; document.getElementById('2411.06207v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.06174</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Robotics">cs.RO</span> </div> </div> <p class="title is-5 mathjax"> State Chrono Representation for Enhancing Generalization in Reinforcement Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+J">Jianda Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Ng%2C+W+Z+T">Wen Zheng Terence Ng</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zichen Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Pan%2C+S+J">Sinno Jialin Pan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+T">Tianwei Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.06174v1-abstract-short" style="display: inline;"> In reinforcement learning with image-based inputs, it is crucial to establish a robust and generalizable state representation. Recent advancements in metric learning, such as deep bisimulation metric approaches, have shown promising results in learning structured low-dimensional representation space from pixel observations, where the distance between states is measured based on task-relevant featu&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06174v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06174v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06174v1-abstract-full" style="display: none;"> In reinforcement learning with image-based inputs, it is crucial to establish a robust and generalizable state representation. Recent advancements in metric learning, such as deep bisimulation metric approaches, have shown promising results in learning structured low-dimensional representation space from pixel observations, where the distance between states is measured based on task-relevant features. However, these approaches face challenges in demanding generalization tasks and scenarios with non-informative rewards. This is because they fail to capture sufficient long-term information in the learned representations. To address these challenges, we propose a novel State Chrono Representation (SCR) approach. SCR augments state metric-based representations by incorporating extensive temporal information into the update step of bisimulation metric learning. It learns state distances within a temporal framework that considers both future dynamics and cumulative rewards over current and long-term future states. Our learning strategy effectively incorporates future behavioral information into the representation space without introducing a significant number of additional parameters for modeling dynamics. Extensive experiments conducted in DeepMind Control and Meta-World environments demonstrate that SCR achieves better performance comparing to other recent metric-based methods in demanding generalization tasks. The codes of SCR are available in <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06174v1-abstract-full').style.display = 'none'; document.getElementById('2411.06174v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> 38th Conference on Neural Information Processing Systems (NeurIPS 2024) </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.06173</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> LSSInst: Improving Geometric Modeling in LSS-Based BEV Perception with Instance Representation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ma%2C+W">Weijie Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+J">Jingwei Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zehui Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+H">Hao Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.06173v2-abstract-short" style="display: inline;"> With the attention gained by camera-only 3D object detection in autonomous driving, methods based on Bird-Eye-View (BEV) representation especially derived from the forward view transformation paradigm, i.e., lift-splat-shoot (LSS), have recently seen significant progress. The BEV representation formulated by the frustum based on depth distribution prediction is ideal for learning the road structur&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06173v2-abstract-full').style.display = 'inline'; document.getElementById('2411.06173v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06173v2-abstract-full" style="display: none;"> With the attention gained by camera-only 3D object detection in autonomous driving, methods based on Bird-Eye-View (BEV) representation especially derived from the forward view transformation paradigm, i.e., lift-splat-shoot (LSS), have recently seen significant progress. The BEV representation formulated by the frustum based on depth distribution prediction is ideal for learning the road structure and scene layout from multi-view images. However, to retain computational efficiency, the compressed BEV representation such as in resolution and axis is inevitably weak in retaining the individual geometric details, undermining the methodological generality and applicability. With this in mind, to compensate for the missing details and utilize multi-view geometry constraints, we propose LSSInst, a two-stage object detector incorporating BEV and instance representations in tandem. The proposed detector exploits fine-grained pixel-level features that can be flexibly integrated into existing LSS-based BEV networks. Having said that, due to the inherent gap between two representation spaces, we design the instance adaptor for the BEV-to-instance semantic coherence rather than pass the proposal naively. Extensive experiments demonstrated that our proposed framework is of excellent generalization ability and performance, which boosts the performances of modern LSS-based BEV perception methods without bells and whistles and outperforms current LSS-based state-of-the-art works on the large-scale nuScenes benchmark. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06173v2-abstract-full').style.display = 'none'; document.getElementById('2411.06173v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 9 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by 3DV 2025</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.05945</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Multiagent Systems">cs.MA</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Y">Yen-Ting Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+C+H">Chao-Han Huck Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhehuai Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zelasko%2C+P">Piotr Zelasko</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+X">Xuesong Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zih-Ching Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Puvvada%2C+K+C">Krishna C Puvvada</a>, <a href="/search/cs?searchtype=author&amp;query=Fu%2C+S">Szu-Wei Fu</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+K">Ke Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Chiu%2C+J+W">Jun Wei Chiu</a>, <a href="/search/cs?searchtype=author&amp;query=Balam%2C+J">Jagadeesh Balam</a>, <a href="/search/cs?searchtype=author&amp;query=Ginsburg%2C+B">Boris Ginsburg</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y+F">Yu-Chiang Frank Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05945v1-abstract-short" style="display: inline;"> Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in pa&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05945v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05945v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05945v1-abstract-full" style="display: none;"> Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an ``expert&#39;&#39; of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset&#39;s tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative $5.0$% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-Opus with $15.5$% to $27.6$% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05945v1-abstract-full').style.display = 'none'; document.getElementById('2411.05945v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NeKo work has been done in June 2024. NeKo LMs will be open source on under the MIT license</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.05879</a> <span>&nbsp;&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Smile upon the Face but Sadness in the Eyes: Emotion Recognition based on Facial Expressions and Eye Behaviors </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yuanyuan Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Wei%2C+L">Lin Wei</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+K">Kejun Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhan%2C+Y">Yibing Zhan</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zijing Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhe Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Shan%2C+S">Shiguang Shan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05879v2-abstract-short" style="display: inline;"> Emotion Recognition (ER) is the process of identifying human emotions from given data. Currently, the field heavily relies on facial expression recognition (FER) because facial expressions contain rich emotional cues. However, it is important to note that facial expressions may not always precisely reflect genuine emotions and FER-based results may yield misleading ER. To understand and bridge thi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05879v2-abstract-full').style.display = 'inline'; document.getElementById('2411.05879v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05879v2-abstract-full" style="display: none;"> Emotion Recognition (ER) is the process of identifying human emotions from given data. Currently, the field heavily relies on facial expression recognition (FER) because facial expressions contain rich emotional cues. However, it is important to note that facial expressions may not always precisely reflect genuine emotions and FER-based results may yield misleading ER. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cues for the creation of a new Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. Different from existing multimodal ER datasets, the EMER dataset employs a stimulus material-induced spontaneous emotion generation method to integrate non-invasive eye behavior data, like eye movements and eye fixation maps, with facial videos, aiming to obtain natural and accurate human emotions. Notably, for the first time, we provide annotations for both ER and FER in the EMER, enabling a comprehensive analysis to better illustrate the gap between both tasks. Furthermore, we specifically design a new EMERT architecture to concurrently enhance performance in both ER and FER by efficiently identifying and bridging the emotion gap between the two.Specifically, our EMERT employs modality-adversarial feature decoupling and multi-task Transformer to augment the modeling of eye behaviors, thus providing an effective complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05879v2-abstract-full').style.display = 'none'; document.getElementById('2411.05879v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 7 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">The paper is part of ongoing work and we request to withdraw it from arXiv to revise it further. And The paper was submitted without agreement from all co-authors</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.05875</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhuotong Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+F">Fang Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+J">Jennifer Zhu</a>, <a href="/search/cs?searchtype=author&amp;query=Du%2C+W">Wanyu Du</a>, <a href="/search/cs?searchtype=author&amp;query=Qi%2C+Y">Yanjun Qi</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05875v1-abstract-short" style="display: inline;"> Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from unstable preference optimization. In this work, we aim to improve the preference optimization pipeline by taking a closer look at preference data generation an&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05875v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05875v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05875v1-abstract-full" style="display: none;"> Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. However, DPO requires high-quality preference data and suffers from unstable preference optimization. In this work, we aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques. For preference data generation, we demonstrate that existing scoring-based reward models produce unsatisfactory preference data and perform poorly on out-of-distribution tasks. This significantly impacts the LLM alignment performance when using these data for preference tuning. To ensure high-quality preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals. For training regularization, we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced. However, the widely used supervised next-word prediction regularization strictly prevents any likelihood reduction of preferred samples. This observation motivates our design of a budget-controlled regularization formulation. Empirically we show that combining the two designs leads to aligned models that surpass existing SOTA across two popular benchmarks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05875v1-abstract-full').style.display = 'none'; document.getElementById('2411.05875v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">15 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.05508</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zijian Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Pradeep%2C+R">Ronak Pradeep</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+J">Jimmy Lin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05508v2-abstract-short" style="display: inline;"> Recent advances have demonstrated that large language models (LLMs) excel as listwise rerankers, but their high computational demands remain a barrier to widespread adoption. Further, the traditional language modeling (LM) objective is not ideally suited for reranking tasks. FIRST is a novel approach that addresses these challenges by integrating a learning-to-rank objective and leveraging the log&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05508v2-abstract-full').style.display = 'inline'; document.getElementById('2411.05508v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05508v2-abstract-full" style="display: none;"> Recent advances have demonstrated that large language models (LLMs) excel as listwise rerankers, but their high computational demands remain a barrier to widespread adoption. Further, the traditional language modeling (LM) objective is not ideally suited for reranking tasks. FIRST is a novel approach that addresses these challenges by integrating a learning-to-rank objective and leveraging the logits of only the first generated token, thereby significantly reducing inference latency compared to traditional LLM rerankers. In this study, we extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains. We investigate the influence of different first-stage retrievers on FIRST rerankers, observing diminishing returns and patterns consistent with traditional LLM rerankers. Through applying the FIRST objective to a broader range of backbone models, we achieve effectiveness surpassing the original implementation. Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality. To better quantify the computational savings in the original study, we measure and compare latency to find a 21%-42% gain across various models and benchmarks. Moreover, while LM training implicitly improves zero-shot single-token reranking, our experiments also raise questions about whether LM pre-training may hinder subsequent fine-tuning with the FIRST objective. These findings pave the way for more efficient and effective listwise reranking in future applications. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05508v2-abstract-full').style.display = 'none'; document.getElementById('2411.05508v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 8 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.05361</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Huang%2C+C">Chien-yu Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+W">Wei-Chih Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+S">Shu-wen Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+A+T">Andy T. Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+C">Chen-An Li</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Y">Yu-Xiang Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Tseng%2C+W">Wei-Cheng Tseng</a>, <a href="/search/cs?searchtype=author&amp;query=Diwan%2C+A">Anuj Diwan</a>, <a href="/search/cs?searchtype=author&amp;query=Shih%2C+Y">Yi-Jen Shih</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+J">Jiatong Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+W">William Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+X">Xuanjun Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Hsiao%2C+C">Chi-Yuan Hsiao</a>, <a href="/search/cs?searchtype=author&amp;query=Peng%2C+P">Puyuan Peng</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+S">Shih-Heng Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Kuan%2C+C">Chun-Yi Kuan</a>, <a href="/search/cs?searchtype=author&amp;query=Lu%2C+K">Ke-Han Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Chang%2C+K">Kai-Wei Chang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+C">Chih-Kai Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Ritter-Gutierrez%2C+F">Fabian Ritter-Gutierrez</a>, <a href="/search/cs?searchtype=author&amp;query=Chuang%2C+M+T">Ming To Chuang</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+K">Kuan-Po Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Arora%2C+S">Siddhant Arora</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Y">You-Kuan Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Yeo%2C+E">Eunjung Yeo</a> , et al. (53 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05361v1-abstract-short" style="display: inline;"> Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05361v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05361v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05361v1-abstract-full" style="display: none;"> Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We will soon open-source all task data and the evaluation pipeline. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05361v1-abstract-full').style.display = 'none'; document.getElementById('2411.05361v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.05317</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Databases">cs.DB</span> </div> </div> <p class="title is-5 mathjax"> SeqRFM: Fast RFM Analysis in Sequence Data </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zheng%2C+Y">Yanxin Zheng</a>, <a href="/search/cs?searchtype=author&amp;query=Gan%2C+W">Wensheng Gan</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zefeng Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+P">Pinlyu Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Fournier-Viger%2C+P">Philippe Fournier-Viger</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05317v1-abstract-short" style="display: inline;"> In recent years, data mining technologies have been well applied to many domains, including e-commerce. In customer relationship management (CRM), the RFM analysis model is one of the most effective approaches to increase the profits of major enterprises. However, with the rapid development of e-commerce, the diversity and abundance of e-commerce data pose a challenge to mining efficiency. Moreove&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05317v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05317v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05317v1-abstract-full" style="display: none;"> In recent years, data mining technologies have been well applied to many domains, including e-commerce. In customer relationship management (CRM), the RFM analysis model is one of the most effective approaches to increase the profits of major enterprises. However, with the rapid development of e-commerce, the diversity and abundance of e-commerce data pose a challenge to mining efficiency. Moreover, in actual market transactions, the chronological order of transactions reflects customer behavior and preferences. To address these challenges, we develop an effective algorithm called SeqRFM, which combines sequential pattern mining with RFM models. SeqRFM considers each customer&#39;s recency (R), frequency (F), and monetary (M) scores to represent the significance of the customer and identifies sequences with high recency, high frequency, and high monetary value. A series of experiments demonstrate the superiority and effectiveness of the SeqRFM algorithm compared to the most advanced RFM algorithms based on sequential pattern mining. The source code and datasets are available at GitHub <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05317v1-abstract-full').style.display = 'none'; document.getElementById('2411.05317v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Preprint. 5 figures, 5 tables</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.04928</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Graphics">cs.GR</span> </div> </div> <p class="title is-5 mathjax"> DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Sun%2C+W">Wenqiang Sun</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+S">Shuo Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+F">Fangfu Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zilong Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Duan%2C+Y">Yueqi Duan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+J">Jun Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yikai Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.04928v1-abstract-short" style="display: inline;"> In this paper, we introduce \textbf{DimensionX}, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown re&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04928v1-abstract-full').style.display = 'inline'; document.getElementById('2411.04928v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.04928v1-abstract-full" style="display: none;"> In this paper, we introduce \textbf{DimensionX}, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to limited spatial and temporal controllability during generation. To overcome this, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware LoRAs from dimension-variant data. This controllable video diffusion approach enables precise manipulation of spatial structure and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames with the combination of spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves superior results in controllable video generation, as well as in 3D and 4D scene generation, compared with previous methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04928v1-abstract-full').style.display = 'none'; document.getElementById('2411.04928v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project Page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.04907</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Enhancing Missing Data Imputation through Combined Bipartite Graph and Complete Directed Graph </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Z">Zhaoyang Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+H">Hongtu Zhu</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Ziqi Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yingjie Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Shu%2C+H">Hai Shu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.04907v1-abstract-short" style="display: inline;"> In this paper, we aim to address a significant challenge in the field of missing data imputation: identifying and leveraging the interdependencies among features to enhance missing data imputation for tabular data. We introduce a novel framework named the Bipartite and Complete Directed Graph Neural Network (BCGNN). Within BCGNN, observations and features are differentiated as two distinct node ty&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04907v1-abstract-full').style.display = 'inline'; document.getElementById('2411.04907v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.04907v1-abstract-full" style="display: none;"> In this paper, we aim to address a significant challenge in the field of missing data imputation: identifying and leveraging the interdependencies among features to enhance missing data imputation for tabular data. We introduce a novel framework named the Bipartite and Complete Directed Graph Neural Network (BCGNN). Within BCGNN, observations and features are differentiated as two distinct node types, and the values of observed features are converted into attributed edges linking them. The bipartite segment of our framework inductively learns embedding representations for nodes, efficiently utilizing the comprehensive information encapsulated in the attributed edges. In parallel, the complete directed graph segment adeptly outlines and communicates the complex interdependencies among features. When compared to contemporary leading imputation methodologies, BCGNN consistently outperforms them, achieving a noteworthy average reduction of 15% in mean absolute error for feature imputation tasks under different missing mechanisms. Our extensive experimental investigation confirms that an in-depth grasp of the interdependence structure substantially enhances the model&#39;s feature embedding ability. We also highlight the model&#39;s superior performance in label prediction tasks involving missing data, and its formidable ability to generalize to unseen data points. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04907v1-abstract-full').style.display = 'none'; document.getElementById('2411.04907v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.04899</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Sampling-guided Heterogeneous Graph Neural Network with Temporal Smoothing for Scalable Longitudinal Data Imputation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Z">Zhaoyang Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Ziqi Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Q">Qiao Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+J">Jinhan Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+H">Hongtu Zhu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.04899v1-abstract-short" style="display: inline;"> In this paper, we propose a novel framework, the Sampling-guided Heterogeneous Graph Neural Network (SHT-GNN), to effectively tackle the challenge of missing data imputation in longitudinal studies. Unlike traditional methods, which often require extensive preprocessing to handle irregular or inconsistent missing data, our approach accommodates arbitrary missing data patterns while maintaining com&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04899v1-abstract-full').style.display = 'inline'; document.getElementById('2411.04899v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.04899v1-abstract-full" style="display: none;"> In this paper, we propose a novel framework, the Sampling-guided Heterogeneous Graph Neural Network (SHT-GNN), to effectively tackle the challenge of missing data imputation in longitudinal studies. Unlike traditional methods, which often require extensive preprocessing to handle irregular or inconsistent missing data, our approach accommodates arbitrary missing data patterns while maintaining computational efficiency. SHT-GNN models both observations and covariates as distinct node types, connecting observation nodes at successive time points through subject-specific longitudinal subnetworks, while covariate-observation interactions are represented by attributed edges within bipartite graphs. By leveraging subject-wise mini-batch sampling and a multi-layer temporal smoothing mechanism, SHT-GNN efficiently scales to large datasets, while effectively learning node representations and imputing missing data. Extensive experiments on both synthetic and real-world datasets, including the Alzheimer&#39;s Disease Neuroimaging Initiative (ADNI) dataset, demonstrate that SHT-GNN significantly outperforms existing imputation methods, even with high missing data rates. The empirical results highlight SHT-GNN&#39;s robust imputation capabilities and superior performance, particularly in the context of complex, large-scale longitudinal data. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04899v1-abstract-full').style.display = 'none'; document.getElementById('2411.04899v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.04097</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Varma%2C+M">Maya Varma</a>, <a href="/search/cs?searchtype=author&amp;query=Delbrouck%2C+J">Jean-Benoit Delbrouck</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhihong Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Chaudhari%2C+A">Akshay Chaudhari</a>, <a href="/search/cs?searchtype=author&amp;query=Langlotz%2C+C">Curtis Langlotz</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.04097v1-abstract-short" style="display: inline;"> Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04097v1-abstract-full').style.display = 'inline'; document.getElementById('2411.04097v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.04097v1-abstract-full" style="display: none;"> Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04097v1-abstract-full').style.display = 'none'; document.getElementById('2411.04097v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NeurIPS 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.03776</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Discrete Mathematics">cs.DM</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Information Theory">cs.IT</span> </div> </div> <p class="title is-5 mathjax"> Reconstruction of multiple strings of constant weight from prefix-suffix compositions </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yaoyu Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zitan Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.03776v1-abstract-short" style="display: inline;"> Motivated by studies of data retrieval in polymer-based storage systems, we consider the problem of reconstructing a multiset of binary strings that have the same length and the same weight from the compositions of their prefixes and suffixes of every possible length. We provide necessary and sufficient conditions for which unique reconstruction up to reversal of the strings is possible. Additiona&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03776v1-abstract-full').style.display = 'inline'; document.getElementById('2411.03776v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.03776v1-abstract-full" style="display: none;"> Motivated by studies of data retrieval in polymer-based storage systems, we consider the problem of reconstructing a multiset of binary strings that have the same length and the same weight from the compositions of their prefixes and suffixes of every possible length. We provide necessary and sufficient conditions for which unique reconstruction up to reversal of the strings is possible. Additionally, we present two algorithms for reconstructing strings from the compositions of prefixes and suffixes of constant-length constant-weight strings. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03776v1-abstract-full').style.display = 'none'; document.getElementById('2411.03776v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.03743</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Quantitative Methods">q-bio.QM</span> </div> </div> <p class="title is-5 mathjax"> Automating Exploratory Proteomics Research via Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ding%2C+N">Ning Ding</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+S">Shang Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+L">Linhai Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Yifei Li</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Z">Zaoqu Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+K">Kaiyan Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+Y">Yibai Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Zuo%2C+Y">Yuxin Zuo</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhangren Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Hua%2C+E">Ermo Hua</a>, <a href="/search/cs?searchtype=author&amp;query=Lv%2C+X">Xingtai Lv</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+Y">Youbang Sun</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Yang Li</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+D">Dong Li</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+F">Fuchu He</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+B">Bowen Zhou</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.03743v1-abstract-short" style="display: inline;"> With the development of artificial intelligence, its contribution to science is evolving from simulating a complex problem to automating entire research processes and producing novel discoveries. Achieving this advancement requires both specialized general models grounded in real-world scientific data and iterative, exploratory frameworks that mirror human scientific methodologies. In this paper,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03743v1-abstract-full').style.display = 'inline'; document.getElementById('2411.03743v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.03743v1-abstract-full" style="display: none;"> With the development of artificial intelligence, its contribution to science is evolving from simulating a complex problem to automating entire research processes and producing novel discoveries. Achieving this advancement requires both specialized general models grounded in real-world scientific data and iterative, exploratory frameworks that mirror human scientific methodologies. In this paper, we present PROTEUS, a fully automated system for scientific discovery from raw proteomics data. PROTEUS uses large language models (LLMs) to perform hierarchical planning, execute specialized bioinformatics tools, and iteratively refine analysis workflows to generate high-quality scientific hypotheses. The system takes proteomics datasets as input and produces a comprehensive set of research objectives, analysis results, and novel biological hypotheses without human intervention. We evaluated PROTEUS on 12 proteomics datasets collected from various biological samples (e.g. immune cells, tumors) and different sample types (single-cell and bulk), generating 191 scientific hypotheses. These were assessed using both automatic LLM-based scoring on 5 metrics and detailed reviews from human experts. Results demonstrate that PROTEUS consistently produces reliable, logically coherent results that align well with existing literature while also proposing novel, evaluable hypotheses. The system&#39;s flexible architecture facilitates seamless integration of diverse analysis tools and adaptation to different proteomics data types. By automating complex proteomics analysis workflows and hypothesis generation, PROTEUS has the potential to considerably accelerate the pace of scientific discovery in proteomics research, enabling researchers to efficiently explore large-scale datasets and uncover biological insights. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03743v1-abstract-full').style.display = 'none'; document.getElementById('2411.03743v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.03670</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Bassi%2C+P+R+A+S">Pedro R. A. S. Bassi</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+W">Wenxuan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+Y">Yucheng Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Isensee%2C+F">Fabian Isensee</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zifu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+J">Jieneng Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Chou%2C+Y">Yu-Cheng Chou</a>, <a href="/search/cs?searchtype=author&amp;query=Kirchhoff%2C+Y">Yannick Kirchhoff</a>, <a href="/search/cs?searchtype=author&amp;query=Rokuss%2C+M">Maximilian Rokuss</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+Z">Ziyan Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Ye%2C+J">Jin Ye</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+J">Junjun He</a>, <a href="/search/cs?searchtype=author&amp;query=Wald%2C+T">Tassilo Wald</a>, <a href="/search/cs?searchtype=author&amp;query=Ulrich%2C+C">Constantin Ulrich</a>, <a href="/search/cs?searchtype=author&amp;query=Baumgartner%2C+M">Michael Baumgartner</a>, <a href="/search/cs?searchtype=author&amp;query=Roy%2C+S">Saikat Roy</a>, <a href="/search/cs?searchtype=author&amp;query=Maier-Hein%2C+K+H">Klaus H. Maier-Hein</a>, <a href="/search/cs?searchtype=author&amp;query=Jaeger%2C+P">Paul Jaeger</a>, <a href="/search/cs?searchtype=author&amp;query=Ye%2C+Y">Yiwen Ye</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+Y">Yutong Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+J">Jianpeng Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Ziyang Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Xia%2C+Y">Yong Xia</a>, <a href="/search/cs?searchtype=author&amp;query=Xing%2C+Z">Zhaohu Xing</a>, <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+L">Lei Zhu</a> , et al. (28 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.03670v1-abstract-short" style="display: inline;"> How can we test AI performance? This question seems trivial, but it isn&#39;t. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03670v1-abstract-full').style.display = 'inline'; document.getElementById('2411.03670v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.03670v1-abstract-full" style="display: none;"> How can we test AI performance? This question seems trivial, but it isn&#39;t. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03670v1-abstract-full').style.display = 'none'; document.getElementById('2411.03670v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted to NeurIPS-2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.03497</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Uncertainty Quantification for Clinical Outcome Predictions with (Large) Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zizhang Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+P">Peizhao Li</a>, <a href="/search/cs?searchtype=author&amp;query=Dong%2C+X">Xiaomeng Dong</a>, <a href="/search/cs?searchtype=author&amp;query=Hong%2C+P">Pengyu Hong</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.03497v1-abstract-short" style="display: inline;"> To facilitate healthcare delivery, language models (LMs) have significant potential for clinical prediction tasks using electronic health records (EHRs). However, in these high-stakes applications, unreliable decisions can result in high costs due to compromised patient safety and ethical concerns, thus increasing the need for good uncertainty modeling of automated clinical predictions. To address&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03497v1-abstract-full').style.display = 'inline'; document.getElementById('2411.03497v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.03497v1-abstract-full" style="display: none;"> To facilitate healthcare delivery, language models (LMs) have significant potential for clinical prediction tasks using electronic health records (EHRs). However, in these high-stakes applications, unreliable decisions can result in high costs due to compromised patient safety and ethical concerns, thus increasing the need for good uncertainty modeling of automated clinical predictions. To address this, we consider the uncertainty quantification of LMs for EHR tasks in white- and black-box settings. We first quantify uncertainty in white-box models, where we can access model parameters and output logits. We show that an effective reduction of model uncertainty can be achieved by using the proposed multi-tasking and ensemble methods in EHRs. Continuing with this idea, we extend our approach to black-box settings, including popular proprietary LMs such as GPT-4. We validate our framework using longitudinal clinical data from more than 6,000 patients in ten clinical prediction tasks. Results show that ensembling methods and multi-task prediction prompts reduce uncertainty across different scenarios. These findings increase the transparency of the model in white-box and black-box settings, thus advancing reliable AI healthcare. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03497v1-abstract-full').style.display = 'none'; document.getElementById('2411.03497v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.03413</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Data Structures and Algorithms">cs.DS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Probability">math.PR</span> </div> </div> <p class="title is-5 mathjax"> Rapid Mixing at the Uniqueness Threshold </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+X">Xiaoyu Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zongchen Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Yin%2C+Y">Yitong Yin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xinyuan Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.03413v1-abstract-short" style="display: inline;"> Over the past decades, a fascinating computational phase transition has been identified in sampling from Gibbs distributions. Though, the computational complexity at the critical point remains poorly understood, as previous algorithmic and hardness results all required a constant slack from this threshold. In this paper, we resolve this open question at the critical phase transition threshold, t&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03413v1-abstract-full').style.display = 'inline'; document.getElementById('2411.03413v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.03413v1-abstract-full" style="display: none;"> Over the past decades, a fascinating computational phase transition has been identified in sampling from Gibbs distributions. Though, the computational complexity at the critical point remains poorly understood, as previous algorithmic and hardness results all required a constant slack from this threshold. In this paper, we resolve this open question at the critical phase transition threshold, thus completing the picture of the computational phase transition. We show that for the hardcore model on graphs with maximum degree $螖\ge 3$ at the uniqueness threshold $位= 位_c(螖)$, the mixing time of Glauber dynamics is upper bounded by a polynomial in $n$, but is not nearly linear in the worst case. For the Ising model (either antiferromagnetic or ferromagnetic), we establish similar results. For the Ising model on graphs with maximum degree $螖\ge 3$ at the critical temperature $尾$ where $|尾| = 尾_c(螖)$, with the tree-uniqueness threshold $尾_c(螖)$, we show that the mixing time of Glauber dynamics is upper bounded by $\tilde{O}\left(n^{2 + O(1/螖)}\right)$ and lower bounded by $惟\left(n^{3/2}\right)$ in the worst case. For the Ising model specified by a critical interaction matrix $J$ with $\left \lVert J \right \rVert_2=1$, we obtain an upper bound $\tilde{O}(n^{3/2})$ for the mixing time, matching the lower bound $惟\left(n^{3/2}\right)$ on the complete graph up to a logarithmic factor. Our mixing time upper bounds are derived from a new interpretation and analysis of the localization scheme method introduced by Chen and Eldan (2022), applied to the field dynamics for the hardcore model and the proximal sampler for the Ising model. As key steps in both our upper and lower bounds, we establish sub-linear upper and lower bounds for spectral independence at the critical point for worst-case instances. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03413v1-abstract-full').style.display = 'none'; document.getElementById('2411.03413v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> </ol> <nav class="pagination is-small is-centered breathe-horizontal" role="navigation" aria-label="pagination"> <a href="" class="pagination-previous is-invisible">Previous </a> <a href="/search/?searchtype=author&amp;query=Chen%2C+Z&amp;start=50" class="pagination-next" >Next </a> <ul class="pagination-list"> <li> <a 