arXiv:2411.18531
Statistic Maximal Leakage
Authors: Shuaiqi Wang, Zinan Lin, Giulia Fanti
Submitted 27 November, 2024; originally announced November 2024. id="2411.18531v1-abstract-short" style="display: inline;"> We introduce a privacy measure called statistic maximal leakage that quantifies how much a privacy mechanism leaks about a specific secret, relative to the adversary&#39;s prior information about that secret. Statistic maximal leakage is an extension of the well-known maximal leakage. Unlike maximal leakage, which protects an arbitrary, unknown secret, statistic maximal leakage protects a single, know&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.18531v1-abstract-full').style.display = 'inline'; document.getElementById('2411.18531v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.18531v1-abstract-full" style="display: none;"> We introduce a privacy measure called statistic maximal leakage that quantifies how much a privacy mechanism leaks about a specific secret, relative to the adversary&#39;s prior information about that secret. Statistic maximal leakage is an extension of the well-known maximal leakage. Unlike maximal leakage, which protects an arbitrary, unknown secret, statistic maximal leakage protects a single, known secret. We show that statistic maximal leakage satisfies composition and post-processing properties. Additionally, we show how to efficiently compute it in the special case of deterministic data release mechanisms. We analyze two important mechanisms under statistic maximal leakage: the quantization mechanism and randomized response. arXiv:2411.17864
Generative Image Layer Decomposition with Visual Effects
Authors: Jinrui Yang, Qing Liu, Yijun Li, Soo Ye Kim, Daniil Pakhomov, Mengwei Ren, Jianming Zhang, Zhe Lin, Cihang Xie, Yuyin Zhou
Submitted 26 November, 2024; originally announced November 2024.
Comments: The project page: However, achieving precise control over image composition tasks remains a challenge. Layered representations, which allow for independent editing of image components, are essential for user-driven content creation, yet existing approaches often strugg&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.17864v1-abstract-full').style.display = 'inline'; document.getElementById('2411.17864v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.17864v1-abstract-full" style="display: none;"> Recent advancements in large generative models, particularly diffusion-based methods, have significantly enhanced the capabilities of image editing. However, achieving precise control over image composition tasks remains a challenge. Layered representations, which allow for independent editing of image components, are essential for user-driven content creation, yet existing approaches often struggle to decompose image into plausible layers with accurately retained transparent visual effects such as shadows and reflections. We propose $\textbf{LayerDecomp}$, a generative framework for image layer decomposition which outputs photorealistic clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects. To enable effective training, we first introduce a dataset preparation pipeline that automatically scales up simulated multi-layer data with synthesized visual effects. To further enhance real-world applicability, we supplement this simulated dataset with camera-captured images containing natural visual effects. Additionally, we propose a consistency loss which enforces the model to learn accurate representations for the transparent foreground layer when ground-truth annotations are not available. Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks across several benchmarks and multiple user studies, unlocking various creative possibilities for layer-wise image editing. arXiv:2411.17761
OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection
Authors: Zhongyu Xia, Jishuo Li, Zhiwei Lin, Xinhao Wang, Yongtao Wang, Ming-Hsuan Yang
Submitted 25 November, 2024; originally announced November 2024. Domain generalization refers to the capabilities of autonomous driving systems across different scenarios and sensor parameter configurations. Open vocabulary pertains to the ability to recognize various semantic categories not encountered during training. In this paper, we introduce OpenAD, the first real-world o&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.17761v1-abstract-full').style.display = 'inline'; document.getElementById('2411.17761v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.17761v1-abstract-full" style="display: none;"> Open-world autonomous driving encompasses domain generalization and open-vocabulary. Domain generalization refers to the capabilities of autonomous driving systems across different scenarios and sensor parameter configurations. Open vocabulary pertains to the ability to recognize various semantic categories not encountered during training. In this paper, we introduce OpenAD, the first real-world open-world autonomous driving benchmark for 3D object detection. OpenAD is built on a corner case discovery and annotation pipeline integrating with a multimodal large language model (MLLM). The proposed pipeline annotates corner case objects in a unified format for five autonomous driving perception datasets with 2000 scenarios. In addition, we devise evaluation methodologies and evaluate various 2D and 3D open-world and specialized models. Moreover, we propose a vision-centric 3D open-world object detection baseline and further introduce an ensemble method by fusing general and specialized models to address the issue of lower precision in existing open-world methods for the OpenAD benchmark. arXiv:2411.17454
FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval
Authors: Jingyou Xie, Jiayi Kuang, Zhenzhou Lin, Jiarui Ouyang, Zishuo Zhao, Ying Shen
Submitted 26 November, 2024; originally announced November 2024. Compared with classical few-shot CMR methods, vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance. However, they still suffer challe&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.17454v1-abstract-full').style.display = 'inline'; document.getElementById('2411.17454v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.17454v1-abstract-full" style="display: none;"> Given a query from one modality, few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain including classes that are disjoint from the source domain. Compared with classical few-shot CMR methods, vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance. However, they still suffer challenges due to (1) the feature degradation encountered in the target domain and (2) the extreme data imbalance. To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP. FLEX-CLIP includes two training stages. In multimodal feature generation, we propose a composite multimodal VAE-GAN network to capture real feature distribution patterns and generate pseudo samples based on CLIP features, addressing data imbalance. For common space projection, we develop a gate residual network to fuse CLIP features with projected features, reducing feature degradation in X-shot scenarios. arXiv:2411.17217
Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning
Authors: Hui-Yue Yang, Hui Chen, Ao Wang, Kai Chen, Zijia Lin, Yongliang Tang, Pengcheng Gao, Yuming Quan, Jungong Han, Guiguang Ding
Submitted 27 November, 2024; v1 submitted 26 November, 2024; originally announced November 2024. However, existing methods that directly apply SAM through prompting often overlook the domain shift issue, where SAM performs well on natural images but struggles in industrial scenarios. Parameter-Efficient Fine-Tuning (PEFT) offers a promising solution, but it may yiel&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.17217v2-abstract-full').style.display = 'inline'; document.getElementById('2411.17217v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.17217v2-abstract-full" style="display: none;"> Segment Anything Model (SAM) has made great progress in anomaly segmentation tasks due to its impressive generalization ability. However, existing methods that directly apply SAM through prompting often overlook the domain shift issue, where SAM performs well on natural images but struggles in industrial scenarios. Parameter-Efficient Fine-Tuning (PEFT) offers a promising solution, but it may yield suboptimal performance by not adequately addressing the perception challenges during adaptation to anomaly images. In this paper, we propose a novel Self-Perceptinon Tuning (SPT) method, aiming to enhance SAM&#39;s perception capability for anomaly segmentation. The SPT method incorporates a self-drafting tuning strategy, which generates an initial coarse draft of the anomaly mask, followed by a refinement process. Additionally, a visual-relation-aware adapter is introduced to improve the perception of discriminative relational information for mask generation. Extensive experimental results on several benchmark datasets demonstrate that our SPT method can significantly outperform baseline methods, validating its effectiveness. arXiv:2411.16034
VisualLens: Personalization through Visual History
Authors: Wang Bill Zhu, Deqing Fu, Kai Sun, Yi Lu, Zhaojiang Lin, Seungwhan Moon, Kanika Narang, Mustafa Canim, Yue Liu, Anuj Kumar, Xin Luna Dong
Submitted 24 November, 2024; originally announced November 2024. Among the many challenges to achieve this goal, the foremost is the diversity and noises in the visual history, containing images not necessarily related to a recommendation task, not necessarily reflecting the&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.16034v1-abstract-full').style.display = 'inline'; document.getElementById('2411.16034v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.16034v1-abstract-full" style="display: none;"> We hypothesize that a user&#39;s visual history with images reflecting their daily life, offers valuable insights into their interests and preferences, and can be leveraged for personalization. Among the many challenges to achieve this goal, the foremost is the diversity and noises in the visual history, containing images not necessarily related to a recommendation task, not necessarily reflecting the user&#39;s interest, or even not necessarily preference-relevant. Existing recommendation systems either rely on task-specific user interaction logs, such as online shopping history for shopping recommendations, or focus on text signals. We propose a novel approach, VisualLens, that extracts, filters, and refines image representations, and leverages these signals for personalization. We created two new benchmarks with task-agnostic visual histories, and show that our method improves over state-of-the-art recommendations by 5-10% on Hit@3, and improves over GPT-4o by 2-5%. arXiv:2411.14847
Dynamics-Aware Gaussian Splatting Streaming Towards Fast On-the-Fly Training for 4D Reconstruction
Authors: Zhening Liu, Yingdong Hu, Xinjie Zhang, Jiawei Shao, Zehong Lin, Jun Zhang
Submitted 22 November, 2024; originally announced November 2024.
Comments: Project page: While existing approaches mainly rely on processing full-length multi-view videos for 4D reconstruction, there has been limited exploration of iterative online reconstruction methods that enable on-the-fly training and per-frame streaming. Current 3DG&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14847v1-abstract-full').style.display = 'inline'; document.getElementById('2411.14847v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.14847v1-abstract-full" style="display: none;"> The recent development of 3D Gaussian Splatting (3DGS) has led to great interest in 4D dynamic spatial reconstruction from multi-view visual inputs. While existing approaches mainly rely on processing full-length multi-view videos for 4D reconstruction, there has been limited exploration of iterative online reconstruction methods that enable on-the-fly training and per-frame streaming. Current 3DGS-based streaming methods treat the Gaussian primitives uniformly and constantly renew the densified Gaussians, thereby overlooking the difference between dynamic and static features and also neglecting the temporal continuity in the scene. To address these limitations, we propose a novel three-stage pipeline for iterative streamable 4D dynamic spatial reconstruction. Our pipeline comprises a selective inheritance stage to preserve temporal continuity, a dynamics-aware shift stage for distinguishing dynamic and static primitives and optimizing their movements, and an error-guided densification stage to accommodate emerging objects. Our method achieves state-of-the-art performance in online 4D reconstruction, demonstrating a 20% improvement in on-the-fly training speed, superior representation quality, and real-time rendering capability. arXiv:2411.14384
Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation
Authors: Yuanhao Cai, He Zhang, Kai Zhang, Yixun Liang, Mengwei Ren, Fujun Luan, Qing Liu, Soo Ye Kim, Jianming Zhang, Zhifei Zhang, Yuqian Zhou, Zhe Lin, Alan Yuille
Submitted 25 November, 2024; v1 submitted 21 November, 2024; originally announced November 2024.
Comments: A novel one-stage 3DGS-based diffusion generates objects and scenes from a single view in ~6 seconds These methods easily collapse when changing the prompt view direction and mainly handle object-centric prompt images. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object and scene generation from a single view. DiffusionGS directly out&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14384v2-abstract-full').style.display = 'inline'; document.getElementById('2411.14384v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.14384v2-abstract-full" style="display: none;"> Existing feed-forward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric prompt images. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object and scene generation from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generalization ability of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that our method enjoys better generation quality (2.20 dB higher in PSNR and 23.25 lower in FID) and over 5x faster speed (~6s on an A100 GPU) than SOTA methods. The user study and text-to-3D applications also reveals the practical values of our method. arXiv:2411.12363
DGSNA: prompt-based Dynamic Generative Scene-based Noise Addition method
Authors: Zihao Chen, Zhentao Lin, Bi Zeng, Linyi Huang, Zhi Li, Jia Cai
Submitted 19 November, 2024; originally announced November 2024. We introduce the prompt-based Dynamic Generative Sce-ne-based Noise Addition method (DGSNA), which innovatively combines the Dynamic Generation of Scene Information (DGSI) with Scene-based Noise Addition for Au&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12363v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12363v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12363v1-abstract-full" style="display: none;"> This paper addresses the challenges of accurately enumerating and describing scenes and the labor-intensive process required to replicate acoustic environments using non-generative methods. We introduce the prompt-based Dynamic Generative Sce-ne-based Noise Addition method (DGSNA), which innovatively combines the Dynamic Generation of Scene Information (DGSI) with Scene-based Noise Addition for Audio (SNAA). Employing generative chat models structured within the Back-ground-Examples-Task (BET) prompt framework, DGSI com-ponent facilitates the dynamic synthesis of tailored Scene Infor-mation (SI) for specific acoustic environments. Additionally, the SNAA component leverages Room Impulse Response (RIR) fil-ters and Text-To-Audio (TTA) systems to generate realistic, scene-based noise that can be adapted for both indoor and out-door environments. Through comprehensive experiments, the adaptability of DGSNA across different generative chat models was demonstrated. The results, assessed through both objective and subjective evaluations, show that DGSNA provides robust performance in dynamically generating precise SI and effectively enhancing scene-based noise addition capabilities, thus offering significant improvements over traditional methods in acoustic scene simulation. arXiv:2411.11576
Hybrid Data-Driven SSM for Interpretable and Label-Free mmWave Channel Prediction
Authors: Yiyong Sun, Jiajun He, Zhidi Lin, Wenqiang Pu, Feng Yin, Hing Cheung So
Submitted 18 November, 2024; originally announced November 2024. Existing channel prediction methods have limitations: classical model-based methods often struggle to track highly nonlinear channel dynamics due to limited expert knowledge, while emerging data-driven methods typically require substantial lab&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11576v1-abstract-full').style.display = 'inline'; document.getElementById('2411.11576v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.11576v1-abstract-full" style="display: none;"> Accurate prediction of mmWave time-varying channels is essential for mitigating the issue of channel aging in complex scenarios owing to high user mobility. Existing channel prediction methods have limitations: classical model-based methods often struggle to track highly nonlinear channel dynamics due to limited expert knowledge, while emerging data-driven methods typically require substantial labeled data for effective training and often lack interpretability. To address these issues, this paper proposes a novel hybrid method that integrates a data-driven neural network into a conventional model-based workflow based on a state-space model (SSM), implicitly tracking complex channel dynamics from data without requiring precise expert knowledge. Additionally, a novel unsupervised learning strategy is developed to train the embedded neural network solely with unlabeled data. Theoretical analyses and ablation studies are conducted to interpret the enhanced benefits gained from the hybrid integration. Numerical simulations based on the 3GPP mmWave channel model corroborate the superior prediction accuracy of the proposed method, compared to state-of-the-art methods that are either purely model-based or data-driven. arXiv:2411.10169
Definition and Detection of Centralization Defects in Smart Contracts
Authors: Zewei Lin, Jiachi Chen, Jiajing Wu, Weizhe Zhang, Zibin Zheng
Submitted 15 November, 2024; originally announced November 2024. A centralization defect refers to any error, flaw, or fault in a smart contract&#39;s design or development stage that introduces a single point of failure. Such defects allow a specific account or user to disrupt the normal operations of smart contracts, potentially ca&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10169v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10169v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10169v1-abstract-full" style="display: none;"> In recent years, security incidents stemming from centralization defects in smart contracts have led to substantial financial losses. A centralization defect refers to any error, flaw, or fault in a smart contract&#39;s design or development stage that introduces a single point of failure. Such defects allow a specific account or user to disrupt the normal operations of smart contracts, potentially causing malfunctions or even complete project shutdowns. Despite the significance of this issue, most current smart contract analyses overlook centralization defects, focusing primarily on other types of defects. To address this gap, our paper introduces six types of centralization defects in smart contracts by manually analyzing 597 Stack Exchange posts and 117 audit reports. For each defect, we provide a detailed description and code examples to illustrate its characteristics and potential impacts. Additionally, we introduce a tool named CDRipper (Centralization Defects Ripper) designed to identify the defined centralization defects. Specifically, CDRipper constructs a permission dependency graph (PDG) and extracts the permission dependencies of functions from the source code of smart contracts. It then detects the sensitive operations in functions and identifies centralization defects based on predefined patterns. We conduct a large-scale experiment using CDRipper on 244,424 real-world smart contracts and evaluate the results based on a manually labeled dataset. arXiv:2411.09133
Computational metaoptics for imaging
Authors: Charles Roques-Carmes, Kai Wang, Yuanmu Yang, Arka Majumdar, Zin Lin
Submitted 13 November, 2024; originally announced November 2024. Concurrently, computational imaging leverages algorithms to reconstruct images from optically processed signals, overcoming limitations of traditional imaging system&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09133v1-abstract-full').style.display = 'inline'; document.getElementById('2411.09133v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.09133v1-abstract-full" style="display: none;"> Metasurfaces -- ultrathin structures composed of subwavelength optical elements -- have revolutionized light manipulation by enabling precise control over electromagnetic waves&#39; amplitude, phase, polarization, and spectral properties. Concurrently, computational imaging leverages algorithms to reconstruct images from optically processed signals, overcoming limitations of traditional imaging systems. This review explores the synergistic integration of metaoptics and computational imaging, &#34;computational metaoptics,&#34; which combines the physical wavefront shaping ability of metasurfaces with advanced computational algorithms to enhance imaging performance beyond conventional limits. We discuss how computational metaoptics addresses the inherent limitations of single-layer metasurfaces in achieving multifunctionality without compromising efficiency. By treating metasurfaces as physical preconditioners and co-designing them with reconstruction algorithms through end-to-end (inverse) design, it is possible to jointly optimize the optical hardware and computational software. This holistic approach allows for the automatic discovery of optimal metasurface designs and reconstruction methods that significantly improve imaging capabilities. Advanced applications enabled by computational metaoptics are highlighted, including phase imaging and quantum state measurement, which benefit from the metasurfaces&#39; ability to manipulate complex light fields and the computational algorithms&#39; capacity to reconstruct high-dimensional information. We also examine performance evaluation challenges, emphasizing the need for new metrics that account for the combined optical and computational nature of these systems. arXiv:2411.07863
CDXFormer: Boosting Remote Sensing Change Detection with Extended Long Short-Term Memory
Authors: Zhenkai Wu, Xiaowen Ma, Rongrong Lian, Zhentao Lin, Wei Zhang
Submitted 12 November, 2024; originally announced November 2024. However, current RS-CD methods lack a balanced consideration of performance and efficiency. CNNs lack global context, Transformers have quadratic computational complexity, and Mambas are restricted by CUDA acceleration. In this paper, we propose CDXFormer, with a&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07863v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07863v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07863v1-abstract-full" style="display: none;"> In complex scenes and varied conditions, effectively integrating spatial-temporal context is crucial for accurately identifying changes. However, current RS-CD methods lack a balanced consideration of performance and efficiency. CNNs lack global context, Transformers have quadratic computational complexity, and Mambas are restricted by CUDA acceleration. In this paper, we propose CDXFormer, with a core component that is a powerful XLSTM-based feature enhancement layer, integrating the advantages of linear computational complexity, global context perception, and strong interpret-ability. Specifically, we introduce a scale-specific Feature Enhancer layer, incorporating a Cross-Temporal Global Perceptron customized for semantic-accurate deep features, and a Cross-Temporal Spatial Refiner customized for detail-rich shallow features. Additionally, we propose a Cross-Scale Interactive Fusion module to progressively interact global change representations with spatial responses. Extensive experimental results demonstrate that CDXFormer achieves state-of-the-art performance across three benchmark datasets, offering a compelling balance between efficiency and accuracy. arXiv:2411.07781
RedCode: Risky Code Execution and Generation Benchmark for Code Agents
Authors: Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, Bo Li
Submitted 12 November, 2024; originally announced November 2024.
Comments: Accepted by NeurIPS 2024 Datasets and Benchmarks Track To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-E&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07781v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07781v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07781v1-abstract-full" style="display: none;"> With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents&#39; ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents&#39; vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. arXiv:2411.07724
Convergence Rate Analysis of LION
Authors: Yiming Dong, Huan Li, Zhouchen Lin
Submitted 12 November, 2024; originally announced November 2024. Although previous studies have investigated its convergence properties, a comprehensive analysis, especially the convergence rate, is still desirable. Recognizing that LION can be regarde&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07724v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07724v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07724v1-abstract-full" style="display: none;"> The LION (evoLved sIgn mOmeNtum) optimizer for deep neural network training was found by Google via program search, with the simple sign update yet showing impressive performance in training large scale networks. Although previous studies have investigated its convergence properties, a comprehensive analysis, especially the convergence rate, is still desirable. Recognizing that LION can be regarded as solving a specific constrained problem, this paper focuses on demonstrating its convergence to the Karush-Kuhn-Tucker (KKT) point at the rate of $\cal O(\sqrt{d}K^{-1/4})$ measured by gradient $\ell_1$ norm, where $d$ is the problem dimension and $K$ is the number of iteration steps. Step further, we remove the constraint and establish that LION converges to the critical point of the general unconstrained problem at the same rate. This rate not only delivers the currently optimal dependence on the problem dimension $d$ but also tightly matches the theoretical lower bound for nonconvex stochastic optimization algorithms, which is typically measured using the gradient $\ell_2$ norm, with respect to the number of iterations $K$. arXiv:2411.07140
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
Authors: Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Weixun Wang, Hui Huang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Zhuoran Lin, Xuepeng Liu, Dekai Sun, Shirong Lin, Zhicheng Zheng, Xiaoyong Zhu, Wenbo Su, Bo Zheng
Submitted 13 November, 2024; v1 submitted 11 November, 2024; originally announced November 2024. In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07140v2-abstract-full').style.display = 'inline'; document.getElementById('2411.07140v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07140v2-abstract-full" style="display: none;"> New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. arXiv:2411.06928
Electroencephalogram-based Multi-class Decoding of Attended Speakers' Direction with Audio Spatial Spectrum
Authors: Yuanming Zhang, Jing Lu, Zhibin Lin, Fei Chen, Haoliang Du, Xia Gao
Submitted 11 November, 2024; originally announced November 2024. Previous works have concentrated on binary directional focus decoding, i.e., determining whether the attended speaker is on the left or right side of the listener. Howev&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06928v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06928v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06928v1-abstract-full" style="display: none;"> Decoding the directional focus of an attended speaker from listeners&#39; electroencephalogram (EEG) signals is essential for developing brain-computer interfaces to improve the quality of life for individuals with hearing impairment. Previous works have concentrated on binary directional focus decoding, i.e., determining whether the attended speaker is on the left or right side of the listener. However, a more precise decoding of the exact direction of the attended speaker is necessary for effective speech processing. Additionally, audio spatial information has not been effectively leveraged, resulting in suboptimal decoding results. In this paper, we observe that, on our recently presented dataset with 15-class directional focus, models relying exclusively on EEG inputs exhibits significantly lower accuracy when decoding the directional focus in both leave-one-subject-out and leave-one-trial-out scenarios. By integrating audio spatial spectra with EEG features, the decoding accuracy can be effectively improved. We employ the CNN, LSM-CNN, and EEG-Deformer models to decode the directional focus from listeners&#39; EEG signals with the auxiliary audio spatial spectra. arXiv:2411.06272
Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models
Authors: Xiaojun Wu, Junxi Liu, Huanyi Su, Zhouchi Lin, Yiyan Qi, Chengjin Xu, Jiajun Su, Jiajie Zhong, Fuwei Wang, Saizhuo Wang, Fengrui Hua, Jia Li, Jian Guo
Submitted 9 November, 2024; originally announced November 2024.
Comments: 26 pages, 9 tables, 3 figures However, existing finance benchmarks often suffer from limited language and task coverage, as well as challenges such as low-quality datasets and inadequate adaptability for LLM evaluation. To address these limitations, we p&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06272v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06272v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06272v1-abstract-full" style="display: none;"> As large language models become increasingly prevalent in the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. However, existing finance benchmarks often suffer from limited language and task coverage, as well as challenges such as low-quality datasets and inadequate adaptability for LLM evaluation. To address these limitations, we propose &#34;Golden Touchstone&#34;, the first comprehensive bilingual benchmark for financial LLMs, which incorporates representative datasets from both Chinese and English across eight core financial NLP tasks. Developed from extensive open source data collection and industry-specific demands, this benchmark includes a variety of financial tasks aimed at thoroughly assessing models&#39; language understanding and generation capabilities. Through comparative analysis of major models on the benchmark, such as GPT-4o Llama3, FinGPT and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks.This research not only provides the financial large language models with a practical evaluation tool but also guides the development and optimization of future research. arXiv:2411.05569
The Framework of NAVIS: Navigating Virtual Spaces with Immersive Scooters
Authors: Zhixun Lin, Wei He, Xinyi Liu, Mingchen Ye, Xiang Li, Ge Lin Kan
Submitted 8 November, 2024; originally announced November 2024.
Journal ref: International Conference on Mobile and Ubiquitous Multimedia 2024 In this paper, we present the conceptual framework of NAVIS (Navigating Virtual Spaces with Immersive Scooters), a novel system that utilizes a scooter-based interface to enhance both navigation and interaction within virtu&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05569v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05569v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05569v1-abstract-full" style="display: none;"> Virtual reality (VR) environments have greatly expanded opportunities for immersive exploration, yet physically navigating these digital spaces remains a significant challenge. In this paper, we present the conceptual framework of NAVIS (Navigating Virtual Spaces with Immersive Scooters), a novel system that utilizes a scooter-based interface to enhance both navigation and interaction within virtual environments. NAVIS combines real-time physical mobility, haptic feedback, and CAVE-like (Cave Automatic Virtual Environment) technology to create a realistic sense of travel and movement, improving both spatial awareness and the overall immersive experience. By offering a more natural and physically engaging method of exploration, NAVIS addresses key limitations found in traditional VR locomotion techniques, such as teleportation or joystick control, which can detract from immersion and realism. arXiv:2411.05504
LBPE: Long-token-first Tokenization to Improve Large Language Models
Authors: Haoran Lian, Yizhe Xiong, Zijia Lin, Jianwei Niu, Shasha Mo, Hui Chen, Peng Liu, Guiguang Ding
Submitted 8 November, 2024; originally announced November 2024.
Comments: arXiv admin note: text overlap with arXiv:2404.17808 Despite its success, a critical challenge persists: long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens, which can result in imbalanced learning issue across different to&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05504v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05504v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05504v1-abstract-full" style="display: none;"> The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens, which can result in imbalanced learning issue across different tokens. To address that, we propose LBPE, which prioritizes long tokens during the encoding process. LBPE generates tokens according to their reverse ranks of token length rather than their ranks in the vocabulary, granting longer tokens higher priority during the encoding process. Consequently, LBPE smooths the frequency differences between short and long tokens, and thus mitigates the learning imbalance. arXiv:2411.05214
STAND-Guard: A Small Task-Adaptive Content Moderation Model
Authors: Minjia Wang, Pingping Lin, Siqi Cai, Shengnan An, Shengjie Ma, Zeqi Lin, Congrui Huang, Bixiong Xu
Submitted 7 November, 2024; originally announced November 2024.
Comments: 20 pages, 1 figure Content moderation contains various tasks, each with its unique requirements tailored to specific scenarios. Therefore, it is crucial to develop a model that can be easily adapted to novel or customized conten&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05214v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05214v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05214v1-abstract-full" style="display: none;"> Content moderation, the process of reviewing and monitoring the safety of generated content, is important for development of welcoming online platforms and responsible large language models. Content moderation contains various tasks, each with its unique requirements tailored to specific scenarios. Therefore, it is crucial to develop a model that can be easily adapted to novel or customized content moderation tasks accurately without extensive model tuning. This paper presents STAND-GUARD, a Small Task-Adaptive coNtent moDeration model. The basic motivation is: by performing instruct tuning on various content moderation tasks, we can unleash the power of small language models (SLMs) on unseen (out-of-distribution) content moderation tasks. We also carefully study the effects of training tasks and model size on the efficacy of cross-task fine-tuning mechanism. Experiments demonstrate STAND-Guard is comparable to GPT-3.5-Turbo across over 40 public datasets, as well as proprietary datasets derived from real-world business scenarios. arXiv:2411.03766
Number Cookbook: Number Understanding of Language Models and How to Improve It
Authors: Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, Muhan Zhang
Submitted 6 November, 2024; originally announced November 2024. The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed sev&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03766v1-abstract-full').style.display = 'inline'; document.getElementById('2411.03766v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.03766v1-abstract-full" style="display: none;"> Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing (such as 9.11 &gt; 9.9). The latter ability is essential for tackling complex arithmetic and mathematical problems and serves as a foundation for most reasoning tasks, but previous work paid little attention to it or only discussed several restricted tasks (like integer addition). In this paper, we comprehensively investigate the numerical understanding and processing ability (NUPA) of LLMs. Firstly, we introduce a benchmark covering four common numerical representations and 17 distinct numerical tasks in four major categories, resulting in 41 meaningful combinations in total. These tasks are derived from primary and secondary education curricula, encompassing nearly all everyday numerical understanding and processing scenarios, and the rules of these tasks are very simple and clear. Through the benchmark, we find that current LLMs fail frequently in many of the tasks. To study the problem, we train small models with existing and potential techniques for enhancing NUPA (such as special tokenizers, PEs, and number formats), comprehensively evaluating their effectiveness using our testbed. We also finetune practical-scale LLMs on our proposed NUPA tasks and find that 1) naive finetuning can improve NUPA a lot on many but not all tasks, and 2) surprisingly, techniques designed to enhance NUPA prove ineffective for finetuning pretrained models. We further explore the impact of chain-of-thought techniques on NUPA. arXiv:2411.03671
Energy-based physics-informed neural network for frictionless contact problems under large deformation
Authors: Jinshuai Bai, Zhongya Lin, Yizheng Wang, Jiancong Wen, Yinghua Liu, Timon Rabczuk, YuanTong Gu, Xi-Qiao Feng
Submitted 6 November, 2024; originally announced November 2024.
Comments: 22 pages, 9 figures In this work, we propose an energy-based physics-informed neural network (PINNs) framework for solving frictionless contact problems under large deformation. Inspired by microscopic Lennard-Jones potential, a sur&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03671v1-abstract-full').style.display = 'inline'; document.getElementById('2411.03671v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.03671v1-abstract-full" style="display: none;"> Numerical methods for contact mechanics are of great importance in engineering applications, enabling the prediction and analysis of complex surface interactions under various conditions. In this work, we propose an energy-based physics-informed neural network (PINNs) framework for solving frictionless contact problems under large deformation. Inspired by microscopic Lennard-Jones potential, a surface contact energy is used to describe the contact phenomena. To ensure the robustness of the proposed PINN framework, relaxation, gradual loading and output scaling techniques are introduced. In the numerical examples, the well-known Hertz contact benchmark problem is conducted, demonstrating the effectiveness and robustness of the proposed PINNs framework. Moreover, challenging contact problems with the consideration of geometrical and material nonlinearities are tested. It has been shown that the proposed PINNs framework provides a reliable and powerful tool for nonlinear contact mechanics. More importantly, the proposed PINNs framework exhibits competitive computational efficiency to the commercial FEM software when dealing with those complex contact problems. arXiv:2411.02457
A Multi-Task Role-Playing Agent Capable of Imitating Character Linguistic Styles
Authors: Siyuan Chen, Qingyi Si, Chenxu Yang, Yunzhi Liang, Zheng Lin, Huan Liu, Weiping Wang
Submitted 3 November, 2024; originally announced November 2024. However, current Role-Playing Agents predominantly focus on mimicking a character&#39;s fundamental attributes while neglecting the replication of linguistic style, and they are incapable of effectively replicating characters when performing tasks beyond multi-turn dialogues, which res&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.02457v1-abstract-full').style.display = 'inline'; document.getElementById('2411.02457v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.02457v1-abstract-full" style="display: none;"> The advent of large language models (LLMs) has significantly propelled the advancement of Role-Playing Agents (RPAs). However, current Role-Playing Agents predominantly focus on mimicking a character&#39;s fundamental attributes while neglecting the replication of linguistic style, and they are incapable of effectively replicating characters when performing tasks beyond multi-turn dialogues, which results in generated responses that lack authenticity. The reason current RPAs lack this capability is due to the nature of existing character datasets, which lack collections of character quotations and are limited to multi-turn dialogue tasks, constraining the RPA&#39;s performance across other task domains and failing to mimic a character&#39;s linguistic style. To address this gap, we developed a multi-task role-playing dataset named MRstyle, which encompasses a substantial number of real individuals along with their quotations and covers seven different tasks. On this basis, we develop StyleRPA, a Multi-Task Role-Playing Agent (MRPA) that significantly outperforms recent open-source LLMs and RPAs baselines on 7 tasks including Dialogue, Dictionary, Composition, Story Generation, Product Description, Music Commentary, and Open Question Answering. arXiv:2411.02394
AutoVFX: Physically Realistic Video Editing from Natural Language Instructions
Authors: Hao-Yu Hsu, Zhi-Hao Lin, Albert Zhai, Hongchi Xia, Shenlong Wang
Submitted 4 November, 2024; originally announced November 2024.
Comments: Project page: However, the creation process remains laborious, complex, and largely inaccessible to everyday users. In this work, we present AutoVFX, a framework that automatically creates realistic and dynamic VFX videos from a single video and natural language instructions. By carefully integ&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.02394v1-abstract-full').style.display = 'inline'; document.getElementById('2411.02394v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.02394v1-abstract-full" style="display: none;"> Modern visual effects (VFX) software has made it possible for skilled artists to create imagery of virtually anything. However, the creation process remains laborious, complex, and largely inaccessible to everyday users. In this work, we present AutoVFX, a framework that automatically creates realistic and dynamic VFX videos from a single video and natural language instructions. By carefully integrating neural scene modeling, LLM-based code generation, and physical simulation, AutoVFX is able to provide physically-grounded, photorealistic editing effects that can be controlled directly using natural language instructions. We conduct extensive experiments to validate AutoVFX&#39;s efficacy across a diverse spectrum of videos and instructions. Quantitative and qualitative results suggest that AutoVFX outperforms all competing methods by a large margin in generative quality, instruction alignment, editing versatility, and physical plausibility. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.02394v1-abstract-full').style.display = 'none'; document.getElementById('2411.02394v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.02385</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> How Far is Video Generation from World Model: A Physical Law Perspective </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+B">Bingyi Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Yue%2C+Y">Yang Yue</a>, <a href="/search/cs?searchtype=author&amp;query=Lu%2C+R">Rui Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhijie Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Y">Yang Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+K">Kaixin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+G">Gao Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+J">Jiashi Feng</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.02385v1-abstract-short" style="display: inline;"> OpenAI&#39;s Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. arXiv:2411.02385
How Far is Video Generation from World Model: A Physical Law Perspective
Authors: Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit &#34;case-based&#34; generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color &gt; size &gt; velocity &gt; shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora&#39;s broader success. See our project page at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.02385v1-abstract-full').style.display = 'none'; document.getElementById('2411.02385v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">preprint</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.01578</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Materials Science">cond-mat.mtrl-sci</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Neural and Evolutionary Computing">cs.NE</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Chemical Physics">physics.chem-ph</span> </div> </div> <p class="title is-5 mathjax"> Integrating Graph Neural Networks and Many-Body Expansion Theory for Potential Energy Surfaces </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+S">Siqi Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zhiqiang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Deng%2C+X">Xianqi Deng</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+Y">Yili Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Ju%2C+C">Cheng-Wei Ju</a>, <a href="/search/cs?searchtype=author&amp;query=Yi%2C+J">Jun Yi</a>, <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+L">Lin Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Ling%2C+G">Guo Ling</a>, <a href="/search/cs?searchtype=author&amp;query=Alhmoud%2C+D">Dieaa Alhmoud</a>, <a href="/search/cs?searchtype=author&amp;query=Guan%2C+H">Hui Guan</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhou Lin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.01578v1-abstract-short" style="display: inline;"> Rational design of next-generation functional materials relied on quantitative predictions of their electronic structures beyond single building blocks. First-principles quantum mechanical (QM) modeling became infeasible as the size of a material grew beyond hundreds of atoms. In this study, we developed a new computational tool integrating fragment-based graph neural networks (FBGNN) into the fra&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.01578v1-abstract-full').style.display = 'inline'; document.getElementById('2411.01578v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.01578v1-abstract-full" style="display: none;"> Rational design of next-generation functional materials relied on quantitative predictions of their electronic structures beyond single building blocks. First-principles quantum mechanical (QM) modeling became infeasible as the size of a material grew beyond hundreds of atoms. In this study, we developed a new computational tool integrating fragment-based graph neural networks (FBGNN) into the fragment-based many-body expansion (MBE) theory, referred to as FBGNN-MBE, and demonstrated its capacity to reproduce full-dimensional potential energy surfaces (FD-PES) for hierarchic chemical systems with manageable accuracy, complexity, and interpretability. In particular, we divided the entire system into basic building blocks (fragments), evaluated their single-fragment energies using a first-principles QM model and attacked many-fragment interactions using the structure-property relationships trained by FBGNNs. Our development of FBGNN-MBE demonstrated the potential of a new framework integrating deep learning models into fragment-based QM methods, and marked a significant step towards computationally aided design of large functional materials. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.01578v1-abstract-full').style.display = 'none'; document.getElementById('2411.01578v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted as a Spotlight paper to NeurIPS 2024 AI4Mat Workshop. See</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.00418</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Self-Evolved Reward Learning for LLMs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Huang%2C+C">Chenghua Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhizhen Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+L">Lu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+F">Fangkai Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+P">Pu Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zeqi Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Q">Qingwei Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+D">Dongmei Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Rajmohan%2C+S">Saravan Rajmohan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Q">Qi Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.00418v1-abstract-short" style="display: inline;"> Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system.&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00418v1-abstract-full').style.display = 'inline'; document.getElementById('2411.00418v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.00418v1-abstract-full" style="display: none;"> Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the language model&#39;s responses. As language models improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs). <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00418v1-abstract-full').style.display = 'none'; document.getElementById('2411.00418v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">19 pages,6 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.22952</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Dong%2C+W">Wei Dong</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+Y">Yuan Sun</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yiting Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xing Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhijun Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Yan%2C+Q">Qingsen Yan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+H">Haokui Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+P">Peng Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+H">Hengtao Shen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.22952v1-abstract-short" style="display: inline;"> A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of down-projection and up-projection matrices, with the bottleneck dimensionality being crucial for reducing the number of learnable parameters, as exemplified by pre&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.22952v1-abstract-full').style.display = 'inline'; document.getElementById('2410.22952v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.22952v1-abstract-full" style="display: none;"> A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of down-projection and up-projection matrices, with the bottleneck dimensionality being crucial for reducing the number of learnable parameters, as exemplified by prevalent methods like LoRA and Adapter. However, these low-rank strategies typically employ a fixed bottleneck dimensionality, which limits their flexibility in handling layer-wise variations. To address this limitation, we propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix. We utilize Householder transformations to construct orthogonal matrices that efficiently mimic the unitary matrices, requiring only a vector. The diagonal values are learned in a layer-wise manner, allowing them to flexibly capture the unique properties of each layer. This approach enables the generation of adaptation matrices with varying ranks across different layers, providing greater flexibility in adapting pre-trained models. Experiments on standard downstream vision tasks demonstrate that our method achieves promising fine-tuning performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.22952v1-abstract-full').style.display = 'none'; document.getElementById('2410.22952v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.22373</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Analytic Continual Test-Time Adaptation for Multi-Modality Corruption </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yufei Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+Y">Yicheng Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Wei%2C+H">Hongxin Wei</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhiping Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhuang%2C+H">Huiping Zhuang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.22373v1-abstract-short" style="display: inline;"> Test-Time Adaptation (TTA) aims to help pre-trained model bridge the gap between source and target datasets using only the pre-trained model and unlabelled test data. A key objective of TTA is to address domain shifts in test data caused by corruption, such as weather changes, noise, or sensor malfunctions. Multi-Modal Continual Test-Time Adaptation (MM-CTTA), an extension of TTA with better real-&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.22373v1-abstract-full').style.display = 'inline'; document.getElementById('2410.22373v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.22373v1-abstract-full" style="display: none;"> Test-Time Adaptation (TTA) aims to help pre-trained model bridge the gap between source and target datasets using only the pre-trained model and unlabelled test data. A key objective of TTA is to address domain shifts in test data caused by corruption, such as weather changes, noise, or sensor malfunctions. Multi-Modal Continual Test-Time Adaptation (MM-CTTA), an extension of TTA with better real-world applications, further allows pre-trained models to handle multi-modal inputs and adapt to continuously-changing target domains. MM-CTTA typically faces challenges including error accumulation, catastrophic forgetting, and reliability bias, with few existing approaches effectively addressing these issues in multi-modal corruption scenarios. In this paper, we propose a novel approach, Multi-modality Dynamic Analytic Adapter (MDAA), for MM-CTTA tasks. We innovatively introduce analytic learning into TTA, using the Analytic Classifiers (ACs) to prevent model forgetting. Additionally, we develop Dynamic Selection Mechanism (DSM) and Soft Pseudo-label Strategy (SPS), which enable MDAA to dynamically filter reliable samples and integrate information from different modalities. Extensive experiments demonstrate that MDAA achieves state-of-the-art performance on MM-CTTA tasks while ensuring reliable model adaptation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.22373v1-abstract-full').style.display = 'none'; document.getElementById('2410.22373v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.20199</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Beigi%2C+M">Mohammad Beigi</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+S">Sijia Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+Y">Ying Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zihao Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Kulkarni%2C+A">Adithya Kulkarni</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+J">Jianfeng He</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+F">Feng Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Jin%2C+M">Ming Jin</a>, <a href="/search/cs?searchtype=author&amp;query=Cho%2C+J">Jin-Hee Cho</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+D">Dawei Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Lu%2C+C">Chang-Tien Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+L">Lifu Huang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.20199v1-abstract-short" style="display: inline;"> In recent years, Large Language Models (LLMs) have become fundamental to a broad spectrum of artificial intelligence applications. As the use of LLMs expands, precisely estimating the uncertainty in their predictions has become crucial. Current methods often struggle to accurately identify, measure, and address the true uncertainty, with many focusing primarily on estimating model confidence. This&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.20199v1-abstract-full').style.display = 'inline'; document.getElementById('2410.20199v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.20199v1-abstract-full" style="display: none;"> In recent years, Large Language Models (LLMs) have become fundamental to a broad spectrum of artificial intelligence applications. As the use of LLMs expands, precisely estimating the uncertainty in their predictions has become crucial. Current methods often struggle to accurately identify, measure, and address the true uncertainty, with many focusing primarily on estimating model confidence. This discrepancy is largely due to an incomplete understanding of where, when, and how uncertainties are injected into models. This paper introduces a comprehensive framework specifically designed to identify and understand the types and sources of uncertainty, aligned with the unique characteristics of LLMs. Our framework enhances the understanding of the diverse landscape of uncertainties by systematically categorizing and defining each type, establishing a solid foundation for developing targeted methods that can precisely quantify these uncertainties. We also provide a detailed introduction to key related concepts and examine the limitations of current methods in mission-critical and safety-sensitive applications. The paper concludes with a perspective on future directions aimed at enhancing the reliability and practical adoption of these methods in real-world scenarios. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.20199v1-abstract-full').style.display = 'none'; document.getElementById('2410.20199v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 26 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.20132</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Biomolecules">q-bio.BM</span> </div> </div> <p class="title is-5 mathjax"> On-Site Precise Screening of SARS-CoV-2 Systems Using a Channel-Wise Attention-Based PLS-1D-CNN Model with Limited Infrared Signatures </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+W">Wenwen Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+Z">Zhouzhuo Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+Y">Yingmei Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+X">Xia Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Q+J">Qi Jie Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhiping Lin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.20132v1-abstract-short" style="display: inline;"> During the early stages of respiratory virus outbreaks, such as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the efficient utilize of limited nasopharyngeal swabs for rapid and accurate screening is crucial for public health. In this study, we present a methodology that integrates attenuated total reflection-Fourier transform infrared spectroscopy (ATR-FTIR) with the adaptive iter&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.20132v1-abstract-full').style.display = 'inline'; document.getElementById('2410.20132v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.20132v1-abstract-full" style="display: none;"> During the early stages of respiratory virus outbreaks, such as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the efficient utilize of limited nasopharyngeal swabs for rapid and accurate screening is crucial for public health. In this study, we present a methodology that integrates attenuated total reflection-Fourier transform infrared spectroscopy (ATR-FTIR) with the adaptive iteratively reweighted penalized least squares (airPLS) preprocessing algorithm and a channel-wise attention-based partial least squares one-dimensional convolutional neural network (PLS-1D-CNN) model, enabling accurate screening of infected individuals within 10 minutes. Two cohorts of nasopharyngeal swab samples, comprising 126 and 112 samples from suspected SARS-CoV-2 Omicron variant cases, were collected at Beijing You&#39;an Hospital for verification. Given that ATR-FTIR spectra are highly sensitive to variations in experimental conditions, which can affect their quality, we propose a biomolecular importance (BMI) evaluation method to assess signal quality across different conditions, validated by comparing BMI with PLS-GBM and PLS-RF results. For the ATR-FTIR signals in cohort 2, which exhibited a higher BMI, airPLS was utilized for signal preprocessing, followed by the application of the channel-wise attention-based PLS-1D-CNN model for screening. The experimental results demonstrate that our model outperforms recently reported methods in the field of respiratory virus spectrum detection, achieving a recognition screening accuracy of 96.48%, a sensitivity of 96.24%, a specificity of 97.14%, an F1-score of 96.12%, and an AUC of 0.99. It meets the World Health Organization (WHO) recommended criteria for an acceptable product: sensitivity of 95.00% or greater and specificity of 97.00% or greater for testing prior SARS-CoV-2 infection in moderate to high volume scenarios. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.20132v1-abstract-full').style.display = 'none'; document.getElementById('2410.20132v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 26 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.19843</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Systems and Control">eess.SY</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Artificial intelligence for partial differential equations in computational mechanics: A review </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yizheng Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Bai%2C+J">Jinshuai Bai</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhongya Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Q">Qimin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Anitescu%2C+C">Cosmin Anitescu</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+J">Jia Sun</a>, <a href="/search/cs?searchtype=author&amp;query=Eshaghi%2C+M+S">Mohammad Sadegh Eshaghi</a>, <a href="/search/cs?searchtype=author&amp;query=Gu%2C+Y">Yuantong Gu</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+X">Xi-Qiao Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Zhuang%2C+X">Xiaoying Zhuang</a>, <a href="/search/cs?searchtype=author&amp;query=Rabczuk%2C+T">Timon Rabczuk</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yinghua Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.19843v2-abstract-short" style="display: inline;"> In recent years, Artificial intelligence (AI) has become ubiquitous, empowering various fields, especially integrating artificial intelligence and traditional science (AI for Science: Artificial intelligence for science), which has attracted widespread attention. In AI for Science, using artificial intelligence algorithms to solve partial differential equations (AI for PDEs: Artificial intelligenc&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.19843v2-abstract-full').style.display = 'inline'; document.getElementById('2410.19843v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.19843v2-abstract-full" style="display: none;"> In recent years, Artificial intelligence (AI) has become ubiquitous, empowering various fields, especially integrating artificial intelligence and traditional science (AI for Science: Artificial intelligence for science), which has attracted widespread attention. In AI for Science, using artificial intelligence algorithms to solve partial differential equations (AI for PDEs: Artificial intelligence for partial differential equations) has become a focal point in computational mechanics. The core of AI for PDEs is the fusion of data and partial differential equations (PDEs), which can solve almost any PDEs. Therefore, this article provides a comprehensive review of the research on AI for PDEs, summarizing the existing algorithms and theories. The article discusses the applications of AI for PDEs in computational mechanics, including solid mechanics, fluid mechanics, and biomechanics. The existing AI for PDEs algorithms include those based on Physics-Informed Neural Networks (PINNs), Deep Energy Methods (DEM), Operator Learning, and Physics-Informed Neural Operator (PINO). AI for PDEs represents a new method of scientific simulation that provides approximate solutions to specific problems using large amounts of data, then fine-tuning according to specific physical laws, avoiding the need to compute from scratch like traditional algorithms. Thus, AI for PDEs is the prototype for future foundation models in computational mechanics, capable of significantly accelerating traditional numerical algorithms. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.19843v2-abstract-full').style.display = 'none'; document.getElementById('2410.19843v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 23 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 21 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.18974</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> 3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+H">Hansheng Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+B">Bokui Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yulin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+R">Ruoxi Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+L">Linqi Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+C+Z">Connor Z. Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Gu%2C+J">Jiayuan Gu</a>, <a href="/search/cs?searchtype=author&amp;query=Su%2C+H">Hao Su</a>, <a href="/search/cs?searchtype=author&amp;query=Wetzstein%2C+G">Gordon Wetzstein</a>, <a href="/search/cs?searchtype=author&amp;query=Guibas%2C+L">Leonidas Guibas</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.18974v1-abstract-short" style="display: inline;"> Multi-view image diffusion models have significantly advanced open-domain 3D object generation. However, most existing models rely on 2D network architectures that lack inherent 3D biases, resulting in compromised geometric consistency. To address this challenge, we introduce 3D-Adapter, a plug-in module designed to infuse 3D geometry awareness into pretrained image diffusion models. Central to ou&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18974v1-abstract-full').style.display = 'inline'; document.getElementById('2410.18974v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.18974v1-abstract-full" style="display: none;"> Multi-view image diffusion models have significantly advanced open-domain 3D object generation. However, most existing models rely on 2D network architectures that lack inherent 3D biases, resulting in compromised geometric consistency. To address this challenge, we introduce 3D-Adapter, a plug-in module designed to infuse 3D geometry awareness into pretrained image diffusion models. Central to our approach is the idea of 3D feedback augmentation: for each denoising step in the sampling loop, 3D-Adapter decodes intermediate multi-view features into a coherent 3D representation, then re-encodes the rendered RGBD views to augment the pretrained base model through feature addition. We study two variants of 3D-Adapter: a fast feed-forward version based on Gaussian splatting and a versatile training-free version utilizing neural fields and meshes. Our extensive experiments demonstrate that 3D-Adapter not only greatly enhances the geometry quality of text-to-multi-view models such as Instant3D and Zero123++, but also enables high-quality 3D generation using the plain text-to-image Stable Diffusion. Furthermore, we showcase the broad application potential of 3D-Adapter by presenting high quality results in text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18974v1-abstract-full').style.display = 'none'; document.getElementById('2410.18974v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.18808</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Delving into the Reversal Curse: How Far Can Large Language Models Generalize? </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhengkai Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Fu%2C+Z">Zhihang Fu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+K">Kai Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+L">Liang Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+B">Binbin Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Wenxiao Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Cai%2C+D">Deng Cai</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+Y">Yue Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Ye%2C+J">Jieping Ye</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.18808v2-abstract-short" style="display: inline;"> While large language models (LLMs) showcase unprecedented capabilities, they also exhibit certain inherent limitations when facing seemingly trivial tasks. A prime example is the recently debated &#34;reversal curse&#34;, which surfaces when models, having been trained on the fact &#34;A is B&#34;, struggle to generalize this knowledge to infer that &#34;B is A&#34;. In this paper, we examine the manifestation of the rev&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18808v2-abstract-full').style.display = 'inline'; document.getElementById('2410.18808v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.18808v2-abstract-full" style="display: none;"> While large language models (LLMs) showcase unprecedented capabilities, they also exhibit certain inherent limitations when facing seemingly trivial tasks. A prime example is the recently debated &#34;reversal curse&#34;, which surfaces when models, having been trained on the fact &#34;A is B&#34;, struggle to generalize this knowledge to infer that &#34;B is A&#34;. In this paper, we examine the manifestation of the reversal curse across various tasks and delve into both the generalization abilities and the problem-solving mechanisms of LLMs. This investigation leads to a series of significant insights: (1) LLMs are able to generalize to &#34;B is A&#34; when both A and B are presented in the context as in the case of a multiple-choice question. (2) This generalization ability is highly correlated to the structure of the fact &#34;A is B&#34; in the training documents. For example, this generalization only applies to biographies structured in &#34;[Name] is [Description]&#34; but not to &#34;[Description] is [Name]&#34;. (3) We propose and verify the hypothesis that LLMs possess an inherent bias in fact recalling during knowledge application, which explains and underscores the importance of the document structure to successful learning. (4) The negative impact of this bias on the downstream performance of LLMs can hardly be mitigated through training alone. These findings offer a novel perspective on interpreting LLMs&#39; generalization through their intrinsic mechanisms and provide insights for developing more effective learning methods. Our code and data are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18808v2-abstract-full').style.display = 'none'; document.getElementById('2410.18808v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 24 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted at NeurIPS 2024. Our code and data are available at</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.18406</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Databases">cs.DB</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> MoMQ: Mixture-of-Experts Enhances Multi-Dialect Query Generation across Relational and Non-Relational Databases </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhisheng Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yifu Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+Z">Zhiling Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+J">Jinyang Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Yu Li</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.18406v1-abstract-short" style="display: inline;"> The improvement in translating natural language to structured query language (SQL) can be attributed to the advancements in large language models (LLMs). Open-source LLMs, tailored for specific database dialects such as MySQL, have shown great performance. However, cloud service providers are looking for a unified database manager service (e.g., Cosmos DB from Azure, Amazon Aurora from AWS, Lindor&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18406v1-abstract-full').style.display = 'inline'; document.getElementById('2410.18406v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.18406v1-abstract-full" style="display: none;"> The improvement in translating natural language to structured query language (SQL) can be attributed to the advancements in large language models (LLMs). Open-source LLMs, tailored for specific database dialects such as MySQL, have shown great performance. However, cloud service providers are looking for a unified database manager service (e.g., Cosmos DB from Azure, Amazon Aurora from AWS, Lindorm from AlibabaCloud) that can support multiple dialects. This requirement has led to the concept of multi-dialect query generation, which presents challenges to LLMs. These challenges include syntactic differences among dialects and imbalanced data distribution across multiple dialects. To tackle these challenges, we propose MoMQ, a novel Mixture-of-Experts-based multi-dialect query generation framework across both relational and non-relational databases. MoMQ employs a dialect expert group for each dialect and a multi-level routing strategy to handle dialect-specific knowledge, reducing interference during query generation. Additionally, a shared expert group is introduced to address data imbalance, facilitating the transfer of common knowledge from high-resource dialects to low-resource ones. Furthermore, we have developed a high-quality multi-dialect query generation benchmark that covers relational and non-relational databases such as MySQL, PostgreSQL, Cypher for Neo4j, and nGQL for NebulaGraph. Extensive experiments have shown that MoMQ performs effectively and robustly even in resource-imbalanced scenarios. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18406v1-abstract-full').style.display = 'none'; document.getElementById('2410.18406v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 23 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.17744</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Learning Versatile Skills with Curriculum Masking </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Tang%2C+Y">Yao Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+Z">Zhihui Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zichuan Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Ye%2C+D">Deheng Ye</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+S">Shuai Li</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.17744v2-abstract-short" style="display: inline;"> Masked prediction has emerged as a promising pretraining paradigm in offline reinforcement learning (RL) due to its versatile masking schemes, enabling flexible inference across various downstream tasks with a unified model. Despite the versatility of masked prediction, it remains unclear how to balance the learning of skills at different levels of complexity. To address this, we propose CurrMask,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17744v2-abstract-full').style.display = 'inline'; document.getElementById('2410.17744v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.17744v2-abstract-full" style="display: none;"> Masked prediction has emerged as a promising pretraining paradigm in offline reinforcement learning (RL) due to its versatile masking schemes, enabling flexible inference across various downstream tasks with a unified model. Despite the versatility of masked prediction, it remains unclear how to balance the learning of skills at different levels of complexity. To address this, we propose CurrMask, a curriculum masking pretraining paradigm for sequential decision making. Motivated by how humans learn by organizing knowledge in a curriculum, CurrMask adjusts its masking scheme during pretraining for learning versatile skills. Through extensive experiments, we show that CurrMask exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks. Additionally, our analysis of training dynamics reveals that CurrMask gradually acquires skills of varying complexity by dynamically adjusting its masking scheme. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17744v2-abstract-full').style.display = 'none'; document.getElementById('2410.17744v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 23 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NeurIPS 2024 poster, 21 pages, 8 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.17700</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Scalable Random Feature Latent Variable Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Ying Li</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhidi Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yuhao Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+M+M">Michael Minyi Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Olmos%2C+P+M">Pablo M. Olmos</a>, <a href="/search/cs?searchtype=author&amp;query=Djuri%C4%87%2C+P+M">Petar M. Djuri膰</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.17700v1-abstract-short" style="display: inline;"> Random feature latent variable models (RFLVMs) represent the state-of-the-art in latent variable models, capable of handling non-Gaussian likelihoods and effectively uncovering patterns in high-dimensional data. However, their heavy reliance on Monte Carlo sampling results in scalability issues which makes it difficult to use these models for datasets with a massive number of observations. To scal&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17700v1-abstract-full').style.display = 'inline'; document.getElementById('2410.17700v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.17700v1-abstract-full" style="display: none;"> Random feature latent variable models (RFLVMs) represent the state-of-the-art in latent variable models, capable of handling non-Gaussian likelihoods and effectively uncovering patterns in high-dimensional data. However, their heavy reliance on Monte Carlo sampling results in scalability issues which makes it difficult to use these models for datasets with a massive number of observations. To scale up RFLVMs, we turn to the optimization-based variational Bayesian inference (VBI) algorithm which is known for its scalability compared to sampling-based methods. However, implementing VBI for RFLVMs poses challenges, such as the lack of explicit probability distribution functions (PDFs) for the Dirichlet process (DP) in the kernel learning component, and the incompatibility of existing VBI algorithms with RFLVMs. To address these issues, we introduce a stick-breaking construction for DP to obtain an explicit PDF and a novel VBI algorithm called ``block coordinate descent variational inference&#34; (BCD-VI). This enables the development of a scalable version of RFLVMs, or in short, SRFLVM. Our proposed method shows scalability, computational efficiency, superior performance in generating informative latent representations and the ability of imputing missing data across various real-world datasets, outperforming state-of-the-art competitors. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17700v1-abstract-full').style.display = 'none'; document.getElementById('2410.17700v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 23 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.17610</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Graphics">cs.GR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Robotics">cs.RO</span> </div> </div> <p class="title is-5 mathjax"> ImDy: Human Inverse Dynamics from Imitated Observations </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+X">Xinpeng Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Liang%2C+J">Junxuan Liang</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zili Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Hou%2C+H">Haowen Hou</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Yong-Lu Li</a>, <a href="/search/cs?searchtype=author&amp;query=Lu%2C+C">Cewu Lu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.17610v1-abstract-short" style="display: inline;"> Inverse dynamics (ID), which aims at reproducing the driven torques from human kinematic observations, has been a critical tool for gait analysis. However, it is hindered from wider application to general motion due to its limited scalability. Conventional optimization-based ID requires expensive laboratory setups, restricting its availability. To alleviate this problem, we propose to exploit the&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17610v1-abstract-full').style.display = 'inline'; document.getElementById('2410.17610v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.17610v1-abstract-full" style="display: none;"> Inverse dynamics (ID), which aims at reproducing the driven torques from human kinematic observations, has been a critical tool for gait analysis. However, it is hindered from wider application to general motion due to its limited scalability. Conventional optimization-based ID requires expensive laboratory setups, restricting its availability. To alleviate this problem, we propose to exploit the recently progressive human motion imitation algorithms to learn human inverse dynamics in a data-driven manner. The key insight is that the human ID knowledge is implicitly possessed by motion imitators, though not directly applicable. In light of this, we devise an efficient data collection pipeline with state-of-the-art motion imitation algorithms and physics simulators, resulting in a large-scale human inverse dynamics benchmark as Imitated Dynamics (ImDy). ImDy contains over 150 hours of motion with joint torque and full-body ground reaction force data. With ImDy, we train a data-driven human inverse dynamics solver ImDyS(olver) in a fully supervised manner, which conducts ID and ground reaction force estimation simultaneously. Experiments on ImDy and real-world data demonstrate the impressive competency of ImDyS in human inverse dynamics and ground reaction force estimation. Moreover, the potential of ImDy(-S) as a fundamental motion analysis tool is exhibited with downstream applications. The project page is <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17610v1-abstract-full').style.display = 'none'; document.getElementById('2410.17610v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 23 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Yong-Lu Li and Cewu Lu are the corresponding authors</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.17095</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> </div> </div> <p class="title is-5 mathjax"> Inferentially-Private Private Information </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+S">Shuaiqi Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Zheng%2C+S">Shuran Zheng</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zinan Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Fanti%2C+G">Giulia Fanti</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+Z+S">Zhiwei Steven Wu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.17095v1-abstract-short" style="display: inline;"> Information disclosure can compromise privacy when revealed information is correlated with private information. We consider the notion of inferential privacy, which measures privacy leakage by bounding the inferential power a Bayesian adversary can gain by observing a released signal. Our goal is to devise an inferentially-private private information structure that maximizes the informativeness of&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17095v1-abstract-full').style.display = 'inline'; document.getElementById('2410.17095v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.17095v1-abstract-full" style="display: none;"> Information disclosure can compromise privacy when revealed information is correlated with private information. We consider the notion of inferential privacy, which measures privacy leakage by bounding the inferential power a Bayesian adversary can gain by observing a released signal. Our goal is to devise an inferentially-private private information structure that maximizes the informativeness of the released signal, following the Blackwell ordering principle, while adhering to inferential privacy constraints. To achieve this, we devise an efficient release mechanism that achieves the inferentially-private Blackwell optimal private information structure for the setting where the private information is binary. Additionally, we propose a programming approach to compute the optimal structure for general cases given the utility function. The design of our mechanisms builds on our geometric characterization of the Blackwell-optimal disclosure mechanisms under privacy constraints, which may be of independent interest. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17095v1-abstract-full').style.display = 'none'; document.getElementById('2410.17095v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.16077</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Su%2C+Z">Zhenpeng Su</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+X">Xing Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zijia Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+Y">Yizhe Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Lv%2C+M">Minxuan Lv</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+G">Guangyuan Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+H">Hui Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+S">Songlin Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Ding%2C+G">Guiguang Ding</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.16077v2-abstract-short" style="display: inline;"> Large language models (LLM) have been attracting much attention from the community recently, due to their remarkable performance in all kinds of downstream tasks. According to the well-known scaling law, scaling up a dense LLM enhances its capabilities, but also significantly increases the computational complexity. Mixture-of-Experts (MoE) models address that by allowing the model size to grow wit&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.16077v2-abstract-full').style.display = 'inline'; document.getElementById('2410.16077v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.16077v2-abstract-full" style="display: none;"> Large language models (LLM) have been attracting much attention from the community recently, due to their remarkable performance in all kinds of downstream tasks. According to the well-known scaling law, scaling up a dense LLM enhances its capabilities, but also significantly increases the computational complexity. Mixture-of-Experts (MoE) models address that by allowing the model size to grow without substantially raising training or inference costs. Yet MoE models face challenges regarding knowledge sharing among experts, making their performance somehow sensitive to routing accuracy. To tackle that, previous works introduced shared experts and combined their outputs with those of the top $K$ routed experts in an ``addition&#39;&#39; manner. In this paper, inspired by collective matrix factorization to learn shared knowledge among data, we propose CartesianMoE, which implements more effective knowledge sharing among experts in more like a ``multiplication&#39;&#39; manner. Extensive experimental results indicate that CartesianMoE outperforms previous MoE models for building LLMs, in terms of both perplexity and downstream task performance. And we also find that CartesianMoE achieves better expert routing robustness. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.16077v2-abstract-full').style.display = 'none'; document.getElementById('2410.16077v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 21 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.15702</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Xu%2C+D">Derong Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Z">Ziheng Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+Z">Zhihong Zhu</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhenxi Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Q">Qidong Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+X">Xian Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+T">Tong Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+X">Xiangyu Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Zheng%2C+Y">Yefeng Zheng</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+E">Enhong Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.15702v1-abstract-short" style="display: inline;"> The impressive capabilities of large language models (LLMs) have attracted extensive interests of applying LLMs to medical field. However, the complex nature of clinical environments presents significant hallucination challenges for LLMs, hindering their widespread adoption. In this paper, we address these hallucination issues in the context of Medical Information Extraction (MIE) tasks by introdu&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.15702v1-abstract-full').style.display = 'inline'; document.getElementById('2410.15702v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.15702v1-abstract-full" style="display: none;"> The impressive capabilities of large language models (LLMs) have attracted extensive interests of applying LLMs to medical field. However, the complex nature of clinical environments presents significant hallucination challenges for LLMs, hindering their widespread adoption. In this paper, we address these hallucination issues in the context of Medical Information Extraction (MIE) tasks by introducing ALternate Contrastive Decoding (ALCD). We begin by redefining MIE tasks as an identify-and-classify process. We then separate the identification and classification functions of LLMs by selectively masking the optimization of tokens during fine-tuning. During the inference stage, we alternately contrast output distributions derived from sub-task models. This approach aims to selectively enhance the identification and classification capabilities while minimizing the influence of other inherent abilities in LLMs. Additionally, we propose an alternate adaptive constraint strategy to more effectively adjust the scale and scope of contrastive tokens. Through comprehensive experiments on two different backbones and six diverse medical information extraction tasks, ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.15702v1-abstract-full').style.display = 'none'; document.getElementById('2410.15702v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by EMNLP 2024 Findings</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.15355</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> LAC: Graph Contrastive Learning with Learnable Augmentation in Continuous Space </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhenyu Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+H">Hongzheng Li</a>, <a href="/search/cs?searchtype=author&amp;query=Shao%2C+Y">Yingxia Shao</a>, <a href="/search/cs?searchtype=author&amp;query=Ye%2C+G">Guanhua Ye</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Yawen Li</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+Q">Quanqing Xu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.15355v1-abstract-short" style="display: inline;"> Graph Contrastive Learning frameworks have demonstrated success in generating high-quality node representations. The existing research on efficient data augmentation methods and ideal pretext tasks for graph contrastive learning remains limited, resulting in suboptimal node representation in the unsupervised setting. In this paper, we introduce LAC, a graph contrastive learning framework with&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.15355v1-abstract-full').style.display = 'inline'; document.getElementById('2410.15355v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.15355v1-abstract-full" style="display: none;"> Graph Contrastive Learning frameworks have demonstrated success in generating high-quality node representations. The existing research on efficient data augmentation methods and ideal pretext tasks for graph contrastive learning remains limited, resulting in suboptimal node representation in the unsupervised setting. In this paper, we introduce LAC, a graph contrastive learning framework with learnable data augmentation in an orthogonal continuous space. To capture the representative information in the graph data during augmentation, we introduce a continuous view augmenter, that applies both a masked topology augmentation module and a cross-channel feature augmentation module to adaptively augment the topological information and the feature information within an orthogonal continuous space, respectively. The orthogonal nature of continuous space ensures that the augmentation process avoids dimension collapse. To enhance the effectiveness of pretext tasks, we propose an information-theoretic principle named InfoBal and introduce corresponding pretext tasks. These tasks enable the continuous view augmenter to maintain consistency in the representative information across views while maximizing diversity between views, and allow the encoder to fully utilize the representative information in the unsupervised setting. Our experimental results show that LAC significantly outperforms the state-of-the-art frameworks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.15355v1-abstract-full').style.display = 'none'; document.getElementById('2410.15355v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.14669</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+B">Baiqi Li</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhiqiu Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Peng%2C+W">Wenxuan Peng</a>, <a href="/search/cs?searchtype=author&amp;query=Nyandwi%2C+J+d+D">Jean de Dieu Nyandwi</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+D">Daniel Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+Z">Zixian Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Khanuja%2C+S">Simran Khanuja</a>, <a href="/search/cs?searchtype=author&amp;query=Krishna%2C+R">Ranjay Krishna</a>, <a href="/search/cs?searchtype=author&amp;query=Neubig%2C+G">Graham Neubig</a>, <a href="/search/cs?searchtype=author&amp;query=Ramanan%2C+D">Deva Ramanan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.14669v2-abstract-short" style="display: inline;"> Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to g&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.14669v2-abstract-full').style.display = 'inline'; document.getElementById('2410.14669v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.14669v2-abstract-full" style="display: none;"> Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a $\textbf{vision-centric}$ design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.14669v2-abstract-full').style.display = 'none'; document.getElementById('2410.14669v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 18 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted to NeurIPS 24; We open-source our dataset at: ; Project page at:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.13790</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Xu%2C+L">Liang Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Hua%2C+S">Shaoyang Hua</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zili Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yifan Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+F">Feipeng Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Yan%2C+Y">Yichao Yan</a>, <a href="/search/cs?searchtype=author&amp;query=Jin%2C+X">Xin Jin</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+X">Xiaokang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Zeng%2C+W">Wenjun Zeng</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.13790v1-abstract-short" style="display: inline;"> In this paper, we tackle the problem of how to build and benchmark a large motion model (LMM). The ultimate goal of LMM is to serve as a foundation model for versatile motion-related tasks, e.g., human motion generation, with interpretability and generalizability. Though advanced, recent LMM-related works are still limited by small-scale motion data and costly text descriptions. Besides, previous&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13790v1-abstract-full').style.display = 'inline'; document.getElementById('2410.13790v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.13790v1-abstract-full" style="display: none;"> In this paper, we tackle the problem of how to build and benchmark a large motion model (LMM). The ultimate goal of LMM is to serve as a foundation model for versatile motion-related tasks, e.g., human motion generation, with interpretability and generalizability. Though advanced, recent LMM-related works are still limited by small-scale motion data and costly text descriptions. Besides, previous motion benchmarks primarily focus on pure body movements, neglecting the ubiquitous motions in context, i.e., humans interacting with humans, objects, and scenes. To address these limitations, we consolidate large-scale video action datasets as knowledge banks to build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions. Different from laboratory-captured motions, in-the-wild human-centric videos contain abundant motions in context. To facilitate better motion text alignment, we also meticulously devise a motion caption generation algorithm to automatically produce rule-based, unbiased, and disentangled text descriptions via the kinematic characteristics for each motion. Extensive experiments show that our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding. Video motions together with the rule-based text annotations could serve as an efficient alternative for larger LMMs. Our dataset, codes, and benchmark will be publicly available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13790v1-abstract-full').style.display = 'none'; document.getElementById('2410.13790v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.13613</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Graphics">cs.GR</span> </div> </div> <p class="title is-5 mathjax"> MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xinjie Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Z">Zhening Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yifan Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Ge%2C+X">Xingtong Ge</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+D">Dailan He</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+T">Tongda Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yan Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zehong Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Yan%2C+S">Shuicheng Yan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+J">Jun Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.13613v1-abstract-short" style="display: inline;"> 4D Gaussian Splatting (4DGS) has recently emerged as a promising technique for capturing complex dynamic 3D scenes with high fidelity. It utilizes a 4D Gaussian representation and a GPU-friendly rasterizer, enabling rapid rendering speeds. Despite its advantages, 4DGS faces significant challenges, notably the requirement of millions of 4D Gaussians, each with extensive associated attributes, leadi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13613v1-abstract-full').style.display = 'inline'; document.getElementById('2410.13613v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.13613v1-abstract-full" style="display: none;"> 4D Gaussian Splatting (4DGS) has recently emerged as a promising technique for capturing complex dynamic 3D scenes with high fidelity. It utilizes a 4D Gaussian representation and a GPU-friendly rasterizer, enabling rapid rendering speeds. Despite its advantages, 4DGS faces significant challenges, notably the requirement of millions of 4D Gaussians, each with extensive associated attributes, leading to substantial memory and storage cost. This paper introduces a memory-efficient framework for 4DGS. We streamline the color attribute by decomposing it into a per-Gaussian direct color component with only 3 parameters and a shared lightweight alternating current color predictor. This approach eliminates the need for spherical harmonics coefficients, which typically involve up to 144 parameters in classic 4DGS, thereby creating a memory-efficient 4D Gaussian representation. Furthermore, we introduce an entropy-constrained Gaussian deformation technique that uses a deformation field to expand the action range of each Gaussian and integrates an opacity-based entropy loss to limit the number of Gaussians, thus forcing our model to use as few Gaussians as possible to fit a dynamic scene well. With simple half-precision storage and zip compression, our framework achieves a storage reduction by approximately 190$\times$ and 125$\times$ on the Technicolor and Neural 3D Video datasets, respectively, compared to the original 4DGS. Meanwhile, it maintains comparable rendering speeds and scene representation quality, setting a new standard in the field. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13613v1-abstract-full').style.display = 'none'; document.getElementById('2410.13613v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.12564</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ruan%2C+J">Jiacheng Ruan</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yebin Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zehao Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+Y">Yuchen Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+F">Feiyu Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+Z">Zeyun Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zhiyu Li</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.12564v2-abstract-short" style="display: inline;"> Benefiting from the revolutionary advances in large language models (LLMs) and foundational vision models, large vision-language models (LVLMs) have also made significant progress. However, current benchmarks focus on tasks that evaluating only a single aspect of LVLM capabilities (e.g., recognition, detection, understanding). These tasks fail to fully demonstrate LVLMs&#39; potential in complex appli&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.12564v2-abstract-full').style.display = 'inline'; document.getElementById('2410.12564v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.12564v2-abstract-full" style="display: none;"> Benefiting from the revolutionary advances in large language models (LLMs) and foundational vision models, large vision-language models (LVLMs) have also made significant progress. However, current benchmarks focus on tasks that evaluating only a single aspect of LVLM capabilities (e.g., recognition, detection, understanding). These tasks fail to fully demonstrate LVLMs&#39; potential in complex application scenarios. To comprehensively assess the performance of existing LVLMs, we propose a more challenging task called the Flow Text with Image Insertion task (FTII). This task requires LVLMs to simultaneously possess outstanding abilities in image comprehension, instruction understanding, and long-text interpretation. Specifically, given several text paragraphs and a set of candidate images, as the text paragraphs accumulate, the LVLMs are required to select the most suitable image from the candidates to insert after the corresponding paragraph. Constructing a benchmark for such a task is highly challenging, particularly in determining the sequence of flowing text and images. To address this challenge, we turn to professional news reports, which naturally contain a gold standard for image-text sequences. Based on this, we introduce the Flow Text with Image Insertion Benchmark (FTII-Bench), which includes 318 high-quality Chinese image-text news articles and 307 high-quality English image-text news articles, covering 10 different news domains. Using these 625 high-quality articles, we construct problems of two different types with multiple levels of difficulty. Furthermore, we establish two different evaluation pipelines based on the CLIP model and existing LVLMs. We evaluate 9 open-source and 2 closed-source LVLMs as well as 2 CLIP-based models. Results indicate that even the most advanced models (e.g., GPT-4o) face significant challenges when tackling the FTII task. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.12564v2-abstract-full').style.display = 'none'; document.getElementById('2410.12564v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Work in progress. 9 pages, 3 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.11228</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> TEOcc: Radar-camera Multi-modal Occupancy Prediction via Temporal Enhancement </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhiwei Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Jin%2C+H">Hongbo Jin</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yongtao Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wei%2C+Y">Yufei Wei</a>, <a href="/search/cs?searchtype=author&amp;query=Dong%2C+N">Nan Dong</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.11228v1-abstract-short" style="display: inline;"> As a novel 3D scene representation, semantic occupancy has gained much attention in autonomous driving. However, existing occupancy prediction methods mainly focus on designing better occupancy representations, such as tri-perspective view or neural radiance fields, while ignoring the advantages of using long-temporal information. In this paper, we propose a radar-camera multi-modal temporal enhan&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.11228v1-abstract-full').style.display = 'inline'; document.getElementById('2410.11228v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.11228v1-abstract-full" style="display: none;"> As a novel 3D scene representation, semantic occupancy has gained much attention in autonomous driving. However, existing occupancy prediction methods mainly focus on designing better occupancy representations, such as tri-perspective view or neural radiance fields, while ignoring the advantages of using long-temporal information. In this paper, we propose a radar-camera multi-modal temporal enhanced occupancy prediction network, dubbed TEOcc. Our method is inspired by the success of utilizing temporal information in 3D object detection. Specifically, we introduce a temporal enhancement branch to learn temporal occupancy prediction. In this branch, we randomly discard the t-k input frame of the multi-view camera and predict its 3D occupancy by long-term and short-term temporal decoders separately with the information from other adjacent frames and multi-modal inputs. Besides, to reduce computational costs and incorporate multi-modal inputs, we specially designed 3D convolutional layers for long-term and short-term temporal decoders. Furthermore, since the lightweight occupancy prediction head is a dense classification head, we propose to use a shared occupancy prediction head for the temporal enhancement and main branches. It is worth noting that the temporal enhancement branch is only performed during training and is discarded during inference. Experiment results demonstrate that TEOcc achieves state-of-the-art occupancy prediction on nuScenes benchmarks. In addition, the proposed temporal enhancement branch is a plug-and-play module that can be easily integrated into existing occupancy prediction methods to improve the performance of occupancy prediction. The code and models will be released at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.11228v1-abstract-full').style.display = 'none'; document.getElementById('2410.11228v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by ECAI2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.10857</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Mirror-Consistency: Harnessing Inconsistency in Majority Voting </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Huang%2C+S">Siyuan Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+Z">Zhiyuan Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Du%2C+J">Jintao Du</a>, <a href="/search/cs?searchtype=author&amp;query=Meng%2C+C">Changhua Meng</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Weiqiang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhouhan Lin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.10857v1-abstract-short" style="display: inline;"> Self-Consistency, a widely-used decoding strategy, significantly boosts the reasoning capabilities of Large Language Models (LLMs). However, it depends on the plurality voting rule, which focuses on the most frequent answer while overlooking all other minority responses. These inconsistent minority views often illuminate areas of uncertainty within the model&#39;s generation process. To address this l&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.10857v1-abstract-full').style.display = 'inline'; document.getElementById('2410.10857v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.10857v1-abstract-full" style="display: none;"> Self-Consistency, a widely-used decoding strategy, significantly boosts the reasoning capabilities of Large Language Models (LLMs). However, it depends on the plurality voting rule, which focuses on the most frequent answer while overlooking all other minority responses. These inconsistent minority views often illuminate areas of uncertainty within the model&#39;s generation process. To address this limitation, we present Mirror-Consistency, an enhancement of the standard Self-Consistency approach. Our method incorporates a &#39;reflective mirror&#39; into the self-ensemble decoding process and enables LLMs to critically examine inconsistencies among multiple generations. Additionally, just as humans use the mirror to better understand themselves, we propose using Mirror-Consistency to enhance the sample-based confidence calibration methods, which helps to mitigate issues of overconfidence. Our experimental results demonstrate that Mirror-Consistency yields superior performance in both reasoning accuracy and confidence calibration compared to Self-Consistency. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.10857v1-abstract-full').style.display = 'none'; document.getElementById('2410.10857v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">EMNLP 2024 Short Findings</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.10816</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> LVD-2M: A Long-take Video Dataset with Temporally Dense Captions </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+T">Tianwei Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuqing Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+D">Daquan Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zhijie Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+J">Jiashi Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+X">Xihui Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.10816v1-abstract-short" style="display: inline;"> The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.10816v1-abstract-full').style.display = 'inline'; document.getElementById('2410.10816v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.10816v1-abstract-full" style="display: none;"> The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.10816v1-abstract-full').style.display = 'none'; document.getElementById('2410.10816v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NeurIPS 2024 Dataset and Benchmark Track. Project page: . Code:</span> </p> </li> </ol> <nav class="pagination is-small is-centered breathe-horizontal" role="navigation" aria-label="pagination"> <a href="" class="pagination-previous is-invisible">Previous </a> <a href="/search/?searchtype=author&amp;query=Lin%2C+Z&amp;start=50" class="pagination-next" >Next </a> <ul class="pagination-list"> <li> <a href="/search/?searchtype=author&amp;query=Lin%2C+Z&amp;start=0" class="pagination-link is-current" aria-label="Goto page 1">1 </a> </li> <li> <a href="/search/?searchtype=author&amp;query=Lin%2C+Z&amp;start=50" class="pagination-link " aria-label="Page 2" aria-current="page">2 </a> </li> <li> <a href="/search/?searchtype=author&amp;query=Lin%2C+Z&amp;start=100" class="pagination-link " aria-label="Page 3" aria-current="page">3 </a> </li> <li> <a href="/search/?searchtype=author&amp;query=Lin%2C+Z&amp;start=150" class="pagination-link " aria-label="Page 4" aria-current="page">4 </a> </li> <li> <a 