class="pagination-ellipsis">&hellip;</span></li> </ul> </nav> <ol class="breathe-horizontal" start="1"> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.12313</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> C$^{2}$INet: Realizing Incremental Trajectory Prediction with Prior-Aware Continual Causal Intervention </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+X">Xiaohe Li</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+F">Feilong Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zide Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Mou%2C+F">Fangli Mou</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+L">Leilei Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Hou%2C+Y">Yingyan Hou</a>, <a href="/search/cs?searchtype=author&amp;query=Wen%2C+L">Lijie Wen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.12313v1-abstract-short" style="display: inline;"> Trajectory prediction for multi-agents in complex scenarios is crucial for applications like autonomous driving. However, existing methods often overlook environmental biases, which leads to poor generalization. Additionally, hardware constraints limit the use of large-scale data across environments, and continual learning settings exacerbate the challenge of catastrophic forgetting. To address th&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12313v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12313v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12313v1-abstract-full" style="display: none;"> Trajectory prediction for multi-agents in complex scenarios is crucial for applications like autonomous driving. However, existing methods often overlook environmental biases, which leads to poor generalization. Additionally, hardware constraints limit the use of large-scale data across environments, and continual learning settings exacerbate the challenge of catastrophic forgetting. To address these issues, we propose the Continual Causal Intervention (C$^{2}$INet) method for generalizable multi-agent trajectory prediction within a continual learning framework. Using variational inference, we align environment-related prior with posterior estimator of confounding factors in the latent space, thereby intervening in causal correlations that affect trajectory representation. Furthermore, we store optimal variational priors across various scenarios using a memory queue, ensuring continuous debiasing during incremental task training. The proposed C$^{2}$INet enhances adaptability to diverse tasks while preserving previous task information to prevent catastrophic forgetting. It also incorporates pruning strategies to mitigate overfitting. Comparative evaluations on three real and synthetic complex datasets against state-of-the-art methods demonstrate that our proposed method consistently achieves reliable prediction performance, effectively mitigating confounding factors unique to different scenarios. This highlights the practical value of our method for real-world applications. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12313v1-abstract-full').style.display = 'none'; document.getElementById('2411.12313v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.11364</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Continual Task Learning through Adaptive Policy Self-Composition </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Hu%2C+S">Shengchao Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yuhang Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Ziqing Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+J">Jifeng Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+L">Li Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Ya Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Tao%2C+D">Dacheng Tao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.11364v1-abstract-short" style="display: inline;"> Training a generalizable agent to continually learn a sequence of tasks from offline trajectories is a natural requirement for long-lived agents, yet remains a significant challenge for current offline reinforcement learning (RL) algorithms. Specifically, an agent must be able to rapidly adapt to new tasks using newly collected trajectories (plasticity), while retaining knowledge from previously l&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11364v1-abstract-full').style.display = 'inline'; document.getElementById('2411.11364v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.11364v1-abstract-full" style="display: none;"> Training a generalizable agent to continually learn a sequence of tasks from offline trajectories is a natural requirement for long-lived agents, yet remains a significant challenge for current offline reinforcement learning (RL) algorithms. Specifically, an agent must be able to rapidly adapt to new tasks using newly collected trajectories (plasticity), while retaining knowledge from previously learned tasks (stability). However, systematic analyses of this setting are scarce, and it remains unclear whether conventional continual learning (CL) methods are effective in continual offline RL (CORL) scenarios. In this study, we develop the Offline Continual World benchmark and demonstrate that traditional CL methods struggle with catastrophic forgetting, primarily due to the unique distribution shifts inherent to CORL scenarios. To address this challenge, we introduce CompoFormer, a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network. Upon encountering a new task, CompoFormer leverages semantic correlations to selectively integrate relevant prior policies alongside newly trained parameters, thereby enhancing knowledge sharing and accelerating the learning process. Our experiments reveal that CompoFormer outperforms conventional CL methods, particularly in longer task sequences, showcasing a promising balance between plasticity and stability. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11364v1-abstract-full').style.display = 'none'; document.getElementById('2411.11364v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">21 pages, 8 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10791</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Science and Game Theory">cs.GT</span> </div> </div> <p class="title is-5 mathjax"> Optimal Fixed-Price Mechanism with Signaling </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhikang Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+W">Weiran Shen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10791v1-abstract-short" style="display: inline;"> Consider a trade market with one seller and multiple buyers. The seller aims to sell an indivisible item and maximize their revenue. This paper focuses on a simple and popular mechanism--the fixed-price mechanism. Unlike the standard setting, we assume there is information asymmetry between buyers and the seller. Specifically, we allow the seller to design information before setting the fixed pric&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10791v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10791v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10791v1-abstract-full" style="display: none;"> Consider a trade market with one seller and multiple buyers. The seller aims to sell an indivisible item and maximize their revenue. This paper focuses on a simple and popular mechanism--the fixed-price mechanism. Unlike the standard setting, we assume there is information asymmetry between buyers and the seller. Specifically, we allow the seller to design information before setting the fixed price, which implies that we study the mechanism design problem in a broader space. We call this mechanism space the fixed-price signaling mechanism. We assume that buyers&#39; valuation of the item depends on the quality of the item. The seller can privately observe the item&#39;s quality, whereas buyers only see its distribution. In this case, the seller can influence buyers&#39; valuations by strategically disclosing information about the item&#39;s quality, thereby adjusting the fixed price. We consider two types of buyers with different levels of rationality: ex-post individual rational (IR) and ex-interim individual rational. We show that when the market has only one buyer, the optimal revenue generated by the fixed-price signaling mechanism is identical to that of the fixed-price mechanism, regardless of the level of rationality. Furthermore, when there are multiple buyers in the market and all of them are ex-post IR, we show that there is no fixed-price mechanism that is obedient for all buyers. However, if all buyers are ex-interim IR, we show that the optimal fixed-price signaling mechanism will generate more revenue for the seller than the fixed-price mechanism. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10791v1-abstract-full').style.display = 'none'; document.getElementById('2411.10791v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.07965</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kong%2C+C">Chuyi Kong</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+Z">Ziyang Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+H">Hongzhan Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhiyuan Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Y">Yaxin Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+Y">Yuxi Sun</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+J">Jing Ma</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.07965v1-abstract-short" style="display: inline;"> The advanced role-playing capabilities of Large Language Models (LLMs) have paved the way for developing Role-Playing Agents (RPAs). However, existing benchmarks, such as HPD, which incorporates manually scored character relationships into the context for LLMs to sort coherence, and SocialBench, which uses specific profiles generated by LLMs in the context of multiple-choice tasks to assess charac&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07965v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07965v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07965v1-abstract-full" style="display: none;"> The advanced role-playing capabilities of Large Language Models (LLMs) have paved the way for developing Role-Playing Agents (RPAs). However, existing benchmarks, such as HPD, which incorporates manually scored character relationships into the context for LLMs to sort coherence, and SocialBench, which uses specific profiles generated by LLMs in the context of multiple-choice tasks to assess character preferences, face limitations like poor generalizability, implicit and inaccurate judgments, and excessive context length. To address the above issues, we propose an automatic, scalable, and generalizable paradigm. Specifically, we construct a benchmark by extracting relations from a general knowledge graph and leverage RPA&#39;s inherent hallucination properties to prompt it to interact across roles, employing ChatGPT for stance detection and defining relationship hallucination along with three related metrics. Extensive experiments validate the effectiveness and stability of our metrics. Our findings further explore factors influencing these metrics and discuss the trade-off between relationship hallucination and factuality. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07965v1-abstract-full').style.display = 'none'; document.getElementById('2411.07965v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.07488</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Science and Game Theory">cs.GT</span> </div> </div> <p class="title is-5 mathjax"> Selling an Item through Persuasion </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhikang Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+W">Weiran Shen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.07488v1-abstract-short" style="display: inline;"> A monopolistic seller aims to sell an indivisible item to multiple potential buyers. Each buyer&#39;s valuation depends on their private type and the item&#39;s quality. The seller can observe the quality but it is unknown to buyers. This quality information is valuable to buyers, so it is beneficial for the seller to strategically design experiments that reveal information about the quality before decidi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07488v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07488v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07488v1-abstract-full" style="display: none;"> A monopolistic seller aims to sell an indivisible item to multiple potential buyers. Each buyer&#39;s valuation depends on their private type and the item&#39;s quality. The seller can observe the quality but it is unknown to buyers. This quality information is valuable to buyers, so it is beneficial for the seller to strategically design experiments that reveal information about the quality before deciding to sell the item to whom and at what price. We study the problem of designing a revenue-maximizing mechanism that allows the seller to disclose information and sell the item. First, we recast the revelation principle to our setting, showing that the seller can focus on one-round mechanisms without loss of generality. We then formulate the mechanism design problem as an optimization problem and derive the optimal solution in closed form. The optimal mechanism includes a set of experiments and payment functions. After eliciting buyers&#39; types, the optimal mechanism asks a buyer to buy and sets a price accordingly. The optimal information structure involves partitioning the quality space. Additionally, we show that our results can be extended to a broader class of distributions and valuation functions. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07488v1-abstract-full').style.display = 'none'; document.getElementById('2411.07488v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.01146</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Task-Aware Harmony Multi-Task Decision Transformer for Offline Reinforcement Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Ziqing Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+S">Shengchao Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yuhang Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+L">Li Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Ya Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yanfeng Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Tao%2C+D">Dacheng Tao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.01146v1-abstract-short" style="display: inline;"> The purpose of offline multi-task reinforcement learning (MTRL) is to develop a unified policy applicable to diverse tasks without the need for online environmental interaction. Recent advancements approach this through sequence modeling, leveraging the Transformer architecture&#39;s scalability and the benefits of parameter sharing to exploit task similarities. However, variations in task content and&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.01146v1-abstract-full').style.display = 'inline'; document.getElementById('2411.01146v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.01146v1-abstract-full" style="display: none;"> The purpose of offline multi-task reinforcement learning (MTRL) is to develop a unified policy applicable to diverse tasks without the need for online environmental interaction. Recent advancements approach this through sequence modeling, leveraging the Transformer architecture&#39;s scalability and the benefits of parameter sharing to exploit task similarities. However, variations in task content and complexity pose significant challenges in policy formulation, necessitating judicious parameter sharing and management of conflicting gradients for optimal policy performance. Furthermore, identifying the optimal parameter subspace for each task often necessitates prior knowledge of the task identifier during inference, limiting applicability in real-world scenarios with variable task content and unknown current tasks. In this work, we introduce the Harmony Multi-Task Decision Transformer (HarmoDT), a novel solution designed to identify an optimal harmony subspace of parameters for each task. We formulate this as a bi-level optimization problem within a meta-learning framework, where the upper level learns masks to define the harmony subspace, while the inner level focuses on updating parameters to improve the overall performance of the unified policy. To eliminate the need for task identifiers, we further design a group-wise variant (G-HarmoDT) that clusters tasks into coherent groups based on gradient information, and utilizes a gating network to determine task identifiers during inference. Empirical evaluations across various benchmarks highlight the superiority of our approach, demonstrating its effectiveness in the multi-task context with specific improvements of 8% gain in task-provided settings, 5% in task-agnostic settings, and 10% in unseen settings. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.01146v1-abstract-full').style.display = 'none'; document.getElementById('2411.01146v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Extension of corresponding ICML edition arXiv:2405.18080. arXiv admin note: substantial text overlap with arXiv:2405.18080</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.00734</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Hardware Architecture">cs.AR</span> </div> </div> <p class="title is-5 mathjax"> Multilayer Dataflow based Butterfly Sparsity Orchestration to Accelerate Attention Workloads </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wu%2C+H">Haibin Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+W">Wenming Li</a>, <a href="/search/cs?searchtype=author&amp;query=Yan%2C+K">Kai Yan</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhihua Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+T">Tianyu Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yuqun Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yanhuan Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Qiang%2C+Z">Ziqing Qiang</a>, <a href="/search/cs?searchtype=author&amp;query=Ye%2C+X">Xiaochun Ye</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+D">Dongrui Fan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.00734v1-abstract-short" style="display: inline;"> Recent neural networks (NNs) with self-attention exhibit competitiveness across different AI domains, but the essential attention mechanism brings massive computation and memory demands. To this end, various sparsity patterns are introduced to reduce the quadratic computation complexity, among which the structured butterfly sparsity has been proven efficient in computation reduction while maintain&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00734v1-abstract-full').style.display = 'inline'; document.getElementById('2411.00734v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.00734v1-abstract-full" style="display: none;"> Recent neural networks (NNs) with self-attention exhibit competitiveness across different AI domains, but the essential attention mechanism brings massive computation and memory demands. To this end, various sparsity patterns are introduced to reduce the quadratic computation complexity, among which the structured butterfly sparsity has been proven efficient in computation reduction while maintaining model accuracy. However, its complicated data accessing pattern brings utilization degradation and makes parallelism hard to exploit in general block-oriented architecture like GPU. Since the reconfigurable dataflow architecture is known to have better data reusability and architectural flexibility in general NN-based acceleration, we want to apply it to the butterfly sparsity for acquiring better computational efficiency for attention workloads. We first propose a hybrid butterfly-sparsity network to obtain better trade-offs between attention accuracy and performance. Next, we propose a scalable multilayer dataflow method supported by coarse-grained streaming parallelism designs, to orchestrate the butterfly sparsity computation on the dataflow array. The experiments show that compared with Jetson Xavier NX, our design has a speedup of up to $14.34\times$ ($9.29\times$ on average) as well as $11.14\times$ energy efficiency advancement in attention workloads. In comparison with SOTA attention accelerators of the same peak performance, our dataflow architecture acquires $2.38\times$-$4.7\times$ efficiency improvement as well as $6.60\times$-$15.37\times$ energy reduction with butterfly sparsity optimization. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00734v1-abstract-full').style.display = 'none'; document.getElementById('2411.00734v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">9 pages, 17 figures, ICCAD 2024, 2024/07/05, Butterfly Sparsity Optimization Using Dataflow</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.00418</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Self-Evolved Reward Learning for LLMs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Huang%2C+C">Chenghua Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhizhen Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+L">Lu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+F">Fangkai Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+P">Pu Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Z">Zeqi Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Q">Qingwei Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+D">Dongmei Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Rajmohan%2C+S">Saravan Rajmohan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Q">Qi Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.00418v1-abstract-short" style="display: inline;"> Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system.&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00418v1-abstract-full').style.display = 'inline'; document.getElementById('2411.00418v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.00418v1-abstract-full" style="display: none;"> Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the language model&#39;s responses. As language models improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs). <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00418v1-abstract-full').style.display = 'none'; document.getElementById('2411.00418v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">19 pages,6 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.00073</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Databases">cs.DB</span> </div> </div> <p class="title is-5 mathjax"> RSL-SQL: Robust Schema Linking in Text-to-SQL Generation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Cao%2C+Z">Zhenbiao Cao</a>, <a href="/search/cs?searchtype=author&amp;query=Zheng%2C+Y">Yuanlei Zheng</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhihao Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xiaojin Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+W">Wei Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.00073v1-abstract-short" style="display: inline;"> Text-to-SQL generation aims to translate natural language questions into SQL statements. In large language models (LLMs) based Text-to-SQL, schema linking is a widely adopted strategy to streamline the input for LLMs by selecting only relevant schema elements, therefore reducing noise and computational overhead. However, schema linking faces risks that requires caution, including the potential omi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00073v1-abstract-full').style.display = 'inline'; document.getElementById('2411.00073v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.00073v1-abstract-full" style="display: none;"> Text-to-SQL generation aims to translate natural language questions into SQL statements. In large language models (LLMs) based Text-to-SQL, schema linking is a widely adopted strategy to streamline the input for LLMs by selecting only relevant schema elements, therefore reducing noise and computational overhead. However, schema linking faces risks that requires caution, including the potential omission of necessary elements and disruption of database structural integrity. To address these challenges, we propose a novel framework called RSL-SQL that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction. Our approach improves the recall of schema linking through forward and backward pruning and hedges the risk by voting between full schema and contextual information augmented simplified schema. Experiments on the BIRD and Spider benchmarks demonstrate that our approach achieves state-of-the-art execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on Spider using GPT-4o. Furthermore, our approach outperforms a series of GPT-4 based Text-to-SQL systems when adopting DeepSeek (much cheaper) with same intact prompts. Extensive analysis and ablation studies confirm the effectiveness of each component in our framework. The codes are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00073v1-abstract-full').style.display = 'none'; document.getElementById('2411.00073v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 31 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.23398</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Science and Game Theory">cs.GT</span> </div> </div> <p class="title is-5 mathjax"> On the Optimality of Dilated Entropy and Lower Bounds for Online Learning in Extensive-Form Games </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhiyuan Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Kroer%2C+C">Christian Kroer</a>, <a href="/search/cs?searchtype=author&amp;query=Farina%2C+G">Gabriele Farina</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.23398v1-abstract-short" style="display: inline;"> First-order methods (FOMs) are arguably the most scalable algorithms for equilibrium computation in large extensive-form games. To operationalize these methods, a distance-generating function, acting as a regularizer for the strategy space, must be chosen. The ratio between the strong convexity modulus and the diameter of the regularizer is a key parameter in the analysis of FOMs. A natural questi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.23398v1-abstract-full').style.display = 'inline'; document.getElementById('2410.23398v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.23398v1-abstract-full" style="display: none;"> First-order methods (FOMs) are arguably the most scalable algorithms for equilibrium computation in large extensive-form games. To operationalize these methods, a distance-generating function, acting as a regularizer for the strategy space, must be chosen. The ratio between the strong convexity modulus and the diameter of the regularizer is a key parameter in the analysis of FOMs. A natural question is then: what is the optimal distance-generating function for extensive-form decision spaces? In this paper, we make a number of contributions, ultimately establishing that the weight-one dilated entropy (DilEnt) distance-generating function is optimal up to logarithmic factors. The DilEnt regularizer is notable due to its iterate-equivalence with Kernelized OMWU (KOMWU) -- the algorithm with state-of-the-art dependence on the game tree size in extensive-form games -- when used in conjunction with the online mirror descent (OMD) algorithm. However, the standard analysis for OMD is unable to establish such a result; the only current analysis is by appealing to the iterate equivalence to KOMWU. We close this gap by introducing a pair of primal-dual treeplex norms, which we contend form the natural analytic viewpoint for studying the strong convexity of DilEnt. Using these norm pairs, we recover the diameter-to-strong-convexity ratio that predicts the same performance as KOMWU. Along with a new regret lower bound for online learning in sequence-form strategy spaces, we show that this ratio is nearly optimal. Finally, we showcase our analytic techniques by refining the analysis of Clairvoyant OMD when paired with DilEnt, establishing an $\mathcal{O}(n \log |\mathcal{V}| \log T/T)$ approximation rate to coarse correlated equilibrium in $n$-player games, where $|\mathcal{V}|$ is the number of reduced normal-form strategies of the players, establishing the new state of the art. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.23398v1-abstract-full').style.display = 'none'; document.getElementById('2410.23398v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.20891</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Science and Game Theory">cs.GT</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.24963/ijcai.2023/300 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Revenue Maximization Mechanisms for an Uninformed Mediator with Communication Abilities </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhikang Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+W">Weiran Shen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.20891v1-abstract-short" style="display: inline;"> Consider a market where a seller owns an item for sale and a buyer wants to purchase it. Each player has private information, known as their type. It can be costly and difficult for the players to reach an agreement through direct communication. However, with a mediator as a trusted third party, both players can communicate privately with the mediator without worrying about leaking too much or too&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.20891v1-abstract-full').style.display = 'inline'; document.getElementById('2410.20891v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.20891v1-abstract-full" style="display: none;"> Consider a market where a seller owns an item for sale and a buyer wants to purchase it. Each player has private information, known as their type. It can be costly and difficult for the players to reach an agreement through direct communication. However, with a mediator as a trusted third party, both players can communicate privately with the mediator without worrying about leaking too much or too little information. The mediator can design and commit to a multi-round communication protocol for both players, in which they update their beliefs about the other player&#39;s type. The mediator cannot force the players to trade but can influence their behaviors by sending messages to them. We study the problem of designing revenue-maximizing mechanisms for the mediator. We show that the mediator can, without loss of generality, focus on a set of direct and incentive-compatible mechanisms. We then formulate this problem as a mathematical program and provide an optimal solution in closed form under a regularity condition. Our mechanism is simple and has a threshold structure. Additionally, we extend our results to general cases by utilizing a variant version of the ironing technique. In the end, we discuss some interesting properties revealed from the optimal mechanism, such as, in the optimal mechanism, the mediator may even lose money in some cases. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.20891v1-abstract-full').style.display = 'none'; document.getElementById('2410.20891v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> ijcai 10 (2023), 2693 - 2700 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.20815</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Grid4D: 4D Decomposed Hash Encoding for High-fidelity Dynamic Gaussian Splatting </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Xu%2C+J">Jiawei Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zexin Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+J">Jian Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+J">Jin Xie</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.20815v1-abstract-short" style="display: inline;"> Recently, Gaussian splatting has received more and more attention in the field of static scene rendering. Due to the low computational overhead and inherent flexibility of explicit representations, plane-based explicit methods are popular ways to predict deformations for Gaussian-based dynamic scene rendering models. However, plane-based methods rely on the inappropriate low-rank assumption and ex&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.20815v1-abstract-full').style.display = 'inline'; document.getElementById('2410.20815v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.20815v1-abstract-full" style="display: none;"> Recently, Gaussian splatting has received more and more attention in the field of static scene rendering. Due to the low computational overhead and inherent flexibility of explicit representations, plane-based explicit methods are popular ways to predict deformations for Gaussian-based dynamic scene rendering models. However, plane-based methods rely on the inappropriate low-rank assumption and excessively decompose the space-time 4D encoding, resulting in overmuch feature overlap and unsatisfactory rendering quality. To tackle these problems, we propose Grid4D, a dynamic scene rendering model based on Gaussian splatting and employing a novel explicit encoding method for the 4D input through the hash encoding. Different from plane-based explicit representations, we decompose the 4D encoding into one spatial and three temporal 3D hash encodings without the low-rank assumption. Additionally, we design a novel attention module that generates the attention scores in a directional range to aggregate the spatial and temporal features. The directional attention enables Grid4D to more accurately fit the diverse deformations across distinct scene components based on the spatial encoded features. Moreover, to mitigate the inherent lack of smoothness in explicit representation methods, we introduce a smooth regularization term that keeps our model from the chaos of deformation prediction. Our experiments demonstrate that Grid4D significantly outperforms the state-of-the-art models in visual quality and rendering speed. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.20815v1-abstract-full').style.display = 'none'; document.getElementById('2410.20815v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by NeurIPS 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.19317</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhiting Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+R">Ruizhe Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+T">Tianxiang Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Z">Zuozhu Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.19317v1-abstract-short" style="display: inline;"> The growing use of large language model (LLM)-based chatbots has raised concerns about fairness. Fairness issues in LLMs can lead to severe consequences, such as bias amplification, discrimination, and harm to marginalized communities. While existing fairness benchmarks mainly focus on single-turn dialogues, multi-turn scenarios, which in fact better reflect real-world conversations, present great&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.19317v1-abstract-full').style.display = 'inline'; document.getElementById('2410.19317v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.19317v1-abstract-full" style="display: none;"> The growing use of large language model (LLM)-based chatbots has raised concerns about fairness. Fairness issues in LLMs can lead to severe consequences, such as bias amplification, discrimination, and harm to marginalized communities. While existing fairness benchmarks mainly focus on single-turn dialogues, multi-turn scenarios, which in fact better reflect real-world conversations, present greater challenges due to conversational complexity and potential bias accumulation. In this paper, we propose a comprehensive fairness benchmark for LLMs in multi-turn dialogue scenarios, \textbf{FairMT-Bench}. Specifically, we formulate a task taxonomy targeting LLM fairness capabilities across three stages: context understanding, user interaction, and instruction trade-offs, with each stage comprising two tasks. To ensure coverage of diverse bias types and attributes, we draw from existing fairness datasets and employ our template to construct a multi-turn dialogue dataset, \texttt{FairMT-10K}. For evaluation, GPT-4 is applied, alongside bias classifiers including Llama-Guard-3 and human validation to ensure robustness. Experiments and analyses on \texttt{FairMT-10K} reveal that in multi-turn dialogue scenarios, current LLMs are more likely to generate biased responses, and there is significant variation in performance across different tasks and models. Based on this, we curate a challenging dataset, \texttt{FairMT-1K}, and test 15 current state-of-the-art (SOTA) LLMs on this dataset. The results show the current state of fairness in LLMs and showcase the utility of this novel approach for assessing fairness in more realistic multi-turn dialogue contexts, calling for future work to focus on LLM fairness improvement and the adoption of \texttt{FairMT-1K} in such efforts. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.19317v1-abstract-full').style.display = 'none'; document.getElementById('2410.19317v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.18956</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Large Spatial Model: End-to-end Unposed Images to Semantic 3D </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhiwen Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+J">Jian Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Cong%2C+W">Wenyan Cong</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+P">Peihao Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+R">Renjie Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wen%2C+K">Kairun Wen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+S">Shijie Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Kadambi%2C+A">Achuta Kadambi</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zhangyang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+D">Danfei Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Ivanovic%2C+B">Boris Ivanovic</a>, <a href="/search/cs?searchtype=author&amp;query=Pavone%2C+M">Marco Pavone</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yue Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.18956v2-abstract-short" style="display: inline;"> Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizin&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18956v2-abstract-full').style.display = 'inline'; document.getElementById('2410.18956v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.18956v2-abstract-full" style="display: none;"> Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity. In this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation, and it can generate versatile label maps by interacting with language at novel viewpoints. Leveraging a Transformer-based architecture, LSM integrates global geometry through pixel-aligned point maps. To enhance spatial attribute regression, we incorporate local context aggregation with multi-scale fusion, improving the accuracy of fine local details. To tackle the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder then parameterizes a set of semantic anisotropic Gaussians, facilitating supervised end-to-end learning. Extensive experiments across various tasks show that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18956v2-abstract-full').style.display = 'none'; document.getElementById('2410.18956v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 24 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project Website:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.17577</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Hardware Architecture">cs.AR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Operating Systems">cs.OS</span> </div> </div> <p class="title is-5 mathjax"> Arcus: SLO Management for Accelerators in the Cloud with Traffic Shaping </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+J">Jiechen Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Shu%2C+R">Ran Shu</a>, <a href="/search/cs?searchtype=author&amp;query=Lim%2C+K">Katie Lim</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zewen Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Anderson%2C+T">Thomas Anderson</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+M">Mingyu Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Jerger%2C+N+E">Natalie Enright Jerger</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.17577v1-abstract-short" style="display: inline;"> Cloud servers use accelerators for common tasks (e.g., encryption, compression, hashing) to improve CPU/GPU efficiency and overall performance. However, users&#39; Service-level Objectives (SLOs) can be violated due to accelerator-related contention. The root cause is that existing solutions for accelerators only focus on isolation or fair allocation of compute and memory resources; they overlook the&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17577v1-abstract-full').style.display = 'inline'; document.getElementById('2410.17577v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.17577v1-abstract-full" style="display: none;"> Cloud servers use accelerators for common tasks (e.g., encryption, compression, hashing) to improve CPU/GPU efficiency and overall performance. However, users&#39; Service-level Objectives (SLOs) can be violated due to accelerator-related contention. The root cause is that existing solutions for accelerators only focus on isolation or fair allocation of compute and memory resources; they overlook the contention for communication-related resources. Specifically, three communication-induced challenges drive us to re-think the problem: (1) Accelerator traffic patterns are diverse, hard to predict, and mixed across users, (2) communication-related components lack effective low-level isolation mechanism to configure, and (3) computational heterogeneity of accelerators lead to unique relationships between the traffic mixture and the corresponding accelerator performance. The focus of this work is meeting SLOs in accelerator-rich systems. We present \design{}, treating accelerator SLO management as traffic management with proactive traffic shaping. We develop an SLO-aware protocol coupled with an offloaded interface on an architecture that supports precise and scalable traffic shaping. We guarantee accelerator SLO for various circumstances, with up to 45% tail latency reduction and less than 1% throughput variance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17577v1-abstract-full').style.display = 'none'; document.getElementById('2410.17577v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 23 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.17073</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Multimedia">cs.MM</span> </div> </div> <p class="title is-5 mathjax"> Personalized Playback Technology: How Short Video Services Create Excellent User Experience </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Deng%2C+W">Weihui Deng</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhiwei Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Fu%2C+D">Deliang Fu</a>, <a href="/search/cs?searchtype=author&amp;query=Gong%2C+Y">Yun Gong</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+S">Shenglan Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+X">Xiaocheng Li</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zheng Li</a>, <a href="/search/cs?searchtype=author&amp;query=Liao%2C+Y">Yiting Liao</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+H">He Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Qiao%2C+C">Chunyu Qiao</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+B">Bin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zhen Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+Z">Zhengyu Xiong</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.17073v2-abstract-short" style="display: inline;"> Short-form video content has become increasingly popular and influential in recent years. Its concise yet engaging format aligns well with todays&#39; fast-paced and on-the-go lifestyles, making it a dominating trend in the digital world. As one of the front runners in the short video platform space, ByteDance has been highly successful in delivering a one-of-a-kind short video experience and attracti&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17073v2-abstract-full').style.display = 'inline'; document.getElementById('2410.17073v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.17073v2-abstract-full" style="display: none;"> Short-form video content has become increasingly popular and influential in recent years. Its concise yet engaging format aligns well with todays&#39; fast-paced and on-the-go lifestyles, making it a dominating trend in the digital world. As one of the front runners in the short video platform space, ByteDance has been highly successful in delivering a one-of-a-kind short video experience and attracting billions of users worldwide. One key contributing factor is its advanced end-to-end personalized short video playback technology, where we pioneered and developed the new technical field over the past five years to optimize user experience. This paper introduces the major concepts and methodologies of this personalized video playback technology that distinguish it from traditional multimedia technologies. More details, including goal setting, iterative process, modeling, experimental methods and required supporting systems, are also provided to encourage deeper research in this area. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17073v2-abstract-full').style.display = 'none'; document.getElementById('2410.17073v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 22 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.14970</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Social and Information Networks">cs.SI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computers and Society">cs.CY</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Taming the Long Tail in Human Mobility Prediction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Xu%2C+X">Xiaohang Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+R">Renhe Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+C">Chuang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zipei Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Sezaki%2C+K">Kaoru Sezaki</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.14970v3-abstract-short" style="display: inline;"> With the popularity of location-based services, human mobility prediction plays a key role in enhancing personalized navigation, optimizing recommendation systems, and facilitating urban mobility and planning. This involves predicting a user&#39;s next POI (point-of-interest) visit using their past visit history. However, the uneven distribution of visitations over time and space, namely the long-tail&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.14970v3-abstract-full').style.display = 'inline'; document.getElementById('2410.14970v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.14970v3-abstract-full" style="display: none;"> With the popularity of location-based services, human mobility prediction plays a key role in enhancing personalized navigation, optimizing recommendation systems, and facilitating urban mobility and planning. This involves predicting a user&#39;s next POI (point-of-interest) visit using their past visit history. However, the uneven distribution of visitations over time and space, namely the long-tail problem in spatial distribution, makes it difficult for AI models to predict those POIs that are less visited by humans. In light of this issue, we propose the Long-Tail Adjusted Next POI Prediction (LoTNext) framework for mobility prediction, combining a Long-Tailed Graph Adjustment module to reduce the impact of the long-tailed nodes in the user-POI interaction graph and a novel Long-Tailed Loss Adjustment module to adjust loss by logit score and sample weight adjustment strategy. Also, we employ the auxiliary prediction task to enhance generalization and accuracy. Our experiments with two real-world trajectory datasets demonstrate that LoTNext significantly surpasses existing state-of-the-art works. Our code is available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.14970v3-abstract-full').style.display = 'none'; document.getElementById('2410.14970v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 19 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by NeurIPS 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.12228</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ma%2C+L">Luyi Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+X">Xiaohan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zezhong Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+J">Jianpeng Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Cho%2C+J">Jason Cho</a>, <a href="/search/cs?searchtype=author&amp;query=Kanumala%2C+P">Praveen Kanumala</a>, <a href="/search/cs?searchtype=author&amp;query=Nag%2C+K">Kaushiki Nag</a>, <a href="/search/cs?searchtype=author&amp;query=Kumar%2C+S">Sushant Kumar</a>, <a href="/search/cs?searchtype=author&amp;query=Achan%2C+K">Kannan Achan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.12228v1-abstract-short" style="display: inline;"> Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modalit&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.12228v1-abstract-full').style.display = 'inline'; document.getElementById('2410.12228v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.12228v1-abstract-full" style="display: none;"> Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs). By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors. The LLM models the user&#39;s interactions including behaviors and item features in natural languages. Initially, the LLM is warmed up using only natural language-based prompts. We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. Further ablation studies validate the effectiveness of our model design and benefits of the TMF. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.12228v1-abstract-full').style.display = 'none'; document.getElementById('2410.12228v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.11683</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Science and Game Theory">cs.GT</span> </div> </div> <p class="title is-5 mathjax"> Optimal Mediation Mechanisms in Bilateral Trade </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhikang Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+W">Weiran Shen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.11683v1-abstract-short" style="display: inline;"> Consider a bilateral trade scenario where a seller seeks to sell an item to a buyer through a trusted mediator. The item&#39;s quality is the seller&#39;s private information, and the buyer&#39;s valuation of the item depends on both the quality and the buyer&#39;s type. The mediator, who is uninformed about the private information of both the seller and buyer, aims to design a mechanism that elicits and reveals&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.11683v1-abstract-full').style.display = 'inline'; document.getElementById('2410.11683v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.11683v1-abstract-full" style="display: none;"> Consider a bilateral trade scenario where a seller seeks to sell an item to a buyer through a trusted mediator. The item&#39;s quality is the seller&#39;s private information, and the buyer&#39;s valuation of the item depends on both the quality and the buyer&#39;s type. The mediator, who is uninformed about the private information of both the seller and buyer, aims to design a mechanism that elicits and reveals information to facilitate communication between two agents. The mediator can also charge a fee for providing such services. In this work, we study the problem of designing mechanisms that maximize revenue for the mediator. We formulate this mechanism design problem as an optimization problem that involves non-linear constraints. Interestingly, under the monotone hazard rate assumption, we can bypass this issue by considering a relaxed problem and showing that the solution to the relaxed problem remains optimal to the original one. In optimal mechanisms, the mediator directly recommends whether to trade after eliciting the agents&#39; types. The mediator privately offers a price to each agent if a trade is recommended. The optimal mechanism adopts a threshold information structure, i.e., it only reveals to the agent whether the other agent&#39;s type exceeds a certain threshold. The optimal payment function of buyer is monotone decreasing to their type, which differs from most existing works. Finally, we discuss some interesting observations revealed by the optimal mechanism. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.11683v1-abstract-full').style.display = 'none'; document.getElementById('2410.11683v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.04521</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wei%2C+L">Lai Wei</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Wenkai Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+X">Xiaoyu Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+Y">Yu Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhihao Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xiaojin Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Wei%2C+Z">Zhongyu Wei</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+W">Wei Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.04521v1-abstract-short" style="display: inline;"> In recent advancements, multimodal large language models (MLLMs) have been fine-tuned on specific medical image datasets to address medical visual question answering (Med-VQA) tasks. However, this common approach of task-specific fine-tuning is costly and necessitates separate models for each downstream task, limiting the exploration of zero-shot capabilities. In this paper, we introduce MC-CoT, a&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.04521v1-abstract-full').style.display = 'inline'; document.getElementById('2410.04521v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.04521v1-abstract-full" style="display: none;"> In recent advancements, multimodal large language models (MLLMs) have been fine-tuned on specific medical image datasets to address medical visual question answering (Med-VQA) tasks. However, this common approach of task-specific fine-tuning is costly and necessitates separate models for each downstream task, limiting the exploration of zero-shot capabilities. In this paper, we introduce MC-CoT, a modular cross-modal collaboration Chain-of-Thought (CoT) framework designed to enhance the zero-shot performance of MLLMs in Med-VQA by leveraging large language models (LLMs). MC-CoT improves reasoning and information extraction by integrating medical knowledge and task-specific guidance, where LLM provides various complex medical reasoning chains and MLLM provides various observations of medical images based on instructions of the LLM. Our experiments on datasets such as SLAKE, VQA-RAD, and PATH-VQA show that MC-CoT surpasses standalone MLLMs and various multimodality CoT frameworks in recall rate and accuracy. These findings highlight the importance of incorporating background information and detailed guidance in addressing complex zero-shot Med-VQA tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.04521v1-abstract-full').style.display = 'none'; document.getElementById('2410.04521v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">21 pages, 14 figures, 6 tables</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.00990</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+J">Jian Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+X">Xukun Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Wentao Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+G">Guoming Li</a>, <a href="/search/cs?searchtype=author&amp;query=Fang%2C+Q">Qihang Fang</a>, <a href="/search/cs?searchtype=author&amp;query=Yuan%2C+R">Ruihong Yuan</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+T">Tianyang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+J+Z">Jason Zhaoxin Fan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.00990v1-abstract-short" style="display: inline;"> Audio-driven talking head generation is a pivotal area within film-making and Virtual Reality. Although existing methods have made significant strides following the end-to-end paradigm, they still encounter challenges in producing videos with high-frequency details due to their limited expressivity in this domain. This limitation has prompted us to explore an effective post-processing approach to&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.00990v1-abstract-full').style.display = 'inline'; document.getElementById('2410.00990v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.00990v1-abstract-full" style="display: none;"> Audio-driven talking head generation is a pivotal area within film-making and Virtual Reality. Although existing methods have made significant strides following the end-to-end paradigm, they still encounter challenges in producing videos with high-frequency details due to their limited expressivity in this domain. This limitation has prompted us to explore an effective post-processing approach to synthesize photo-realistic talking head videos. Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities. Drawing on the theory of Lipschitz Continuity, we have theoretically established the noise robustness of Vector Quantised Auto Encoders (VQAEs). Our experiments further demonstrate that the high-frequency texture deficiency of the foundation model can be temporally consistently recovered by the Space-Optimised Vector Quantised Auto Encoder (SOVQAE) we introduced, thereby facilitating the creation of realistic talking head videos. We conduct experiments on both the conventional dataset and the High-Frequency TalKing head (HFTK) dataset that we curated. The results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.00990v1-abstract-full').style.display = 'none'; document.getElementById('2410.00990v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.00629</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> An Illumination-Robust Feature Extractor Augmented by Relightable 3D Reconstruction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+S">Shunyi Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+Z">Zehuan Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zuxin Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Z">Zhihao Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Ruan%2C+L">Lecheng Ruan</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Q">Qining Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.00629v1-abstract-short" style="display: inline;"> Visual features, whose description often relies on the local intensity and gradient direction, have found wide applications in robot navigation and localization in recent years. However, the extraction of visual features is usually disturbed by the variation of illumination conditions, making it challenging for real-world applications. Previous works have addressed this issue by establishing datas&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.00629v1-abstract-full').style.display = 'inline'; document.getElementById('2410.00629v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.00629v1-abstract-full" style="display: none;"> Visual features, whose description often relies on the local intensity and gradient direction, have found wide applications in robot navigation and localization in recent years. However, the extraction of visual features is usually disturbed by the variation of illumination conditions, making it challenging for real-world applications. Previous works have addressed this issue by establishing datasets with variations in illumination conditions, but can be costly and time-consuming. This paper proposes a design procedure for an illumination-robust feature extractor, where the recently developed relightable 3D reconstruction techniques are adopted for rapid and direct data generation with varying illumination conditions. A self-supervised framework is proposed for extracting features with advantages in repeatability for key points and similarity for descriptors across good and bad illumination conditions. Experiments are conducted to demonstrate the effectiveness of the proposed method for robust feature extraction. Ablation studies also indicate the effectiveness of the self-supervised framework design. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.00629v1-abstract-full').style.display = 'none'; document.getElementById('2410.00629v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.19872</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Pan%2C+K">Kaihang Pan</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhaoyu Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+J">Juncheng Li</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+Q">Qifan Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Fei%2C+H">Hao Fei</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+S">Siliang Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Hong%2C+R">Richang Hong</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+H">Hanwang Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+Q">Qianru Sun</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.19872v3-abstract-short" style="display: inline;"> The swift advancement in Multimodal LLMs (MLLMs) also presents significant challenges for effective knowledge editing. Current methods, including intrinsic knowledge editing and external knowledge resorting, each possess strengths and weaknesses, struggling to balance the desired properties of reliability, generality, and locality when applied to MLLMs. In this paper, we propose UniKE, a novel mul&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.19872v3-abstract-full').style.display = 'inline'; document.getElementById('2409.19872v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.19872v3-abstract-full" style="display: none;"> The swift advancement in Multimodal LLMs (MLLMs) also presents significant challenges for effective knowledge editing. Current methods, including intrinsic knowledge editing and external knowledge resorting, each possess strengths and weaknesses, struggling to balance the desired properties of reliability, generality, and locality when applied to MLLMs. In this paper, we propose UniKE, a novel multimodal editing method that establishes a unified perspective and paradigm for intrinsic knowledge editing and external knowledge resorting. Both types of knowledge are conceptualized as vectorized key-value memories, with the corresponding editing processes resembling the assimilation and accommodation phases of human cognition, conducted at the same semantic levels. Within such a unified framework, we further promote knowledge collaboration by disentangling the knowledge representations into the semantic and truthfulness spaces. Extensive experiments validate the effectiveness of our method, which ensures that the post-edit MLLM simultaneously maintains excellent reliability, generality, and locality. The code for UniKE is available at \url{}. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.19872v3-abstract-full').style.display = 'none'; document.getElementById('2409.19872v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 29 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by NeurIPS 2024 (Spotlight)</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.15386</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Coverage and Bias of Street View Imagery in Mapping the Urban Environment </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zicheng Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+C">Chen-Chieh Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Biljecki%2C+F">Filip Biljecki</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.15386v1-abstract-short" style="display: inline;"> Street View Imagery (SVI) has emerged as a valuable data form in urban studies, enabling new ways to map and sense urban environments. However, fundamental concerns regarding the representativeness, quality, and reliability of SVI remain underexplored, e.g.\ to what extent can cities be captured by such data and do data gaps result in bias. This research, positioned at the intersection of spatial&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.15386v1-abstract-full').style.display = 'inline'; document.getElementById('2409.15386v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.15386v1-abstract-full" style="display: none;"> Street View Imagery (SVI) has emerged as a valuable data form in urban studies, enabling new ways to map and sense urban environments. However, fundamental concerns regarding the representativeness, quality, and reliability of SVI remain underexplored, e.g.\ to what extent can cities be captured by such data and do data gaps result in bias. This research, positioned at the intersection of spatial data quality and urban analytics, addresses these concerns by proposing a novel workflow to estimate SVI&#39;s feature-level coverage on urban environment. The workflow integrates the positional relationships between SVI and target features, as well as the impact of environmental obstructions. Expanding the domain of data quality to SVI, we introduce an indicator system that evaluates the extent of coverage, focusing on the completeness and frequency dimensions. Using London as a case study, three experiments are conducted to identify potential biases in SVI&#39;s ability to cover and represent urban features, with a focus on building facades. The research highlights the limitations of traditional spatial data quality metrics in assessing SVI, and variability of SVI coverage under different data acquisition practices. Tailored approaches that consider the unique metadata and horizontal perspective of SVI are also underscored. The findings suggest that while SVI offers valuable insights, it is no panacea -- its application in urban research requires careful consideration of data coverage and feature-level representativeness to ensure reliable results. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.15386v1-abstract-full').style.display = 'none'; document.getElementById('2409.15386v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.14682</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Robust Training Objectives Improve Embedding-based Retrieval in Industrial Recommendation Systems </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kolodner%2C+M">Matthew Kolodner</a>, <a href="/search/cs?searchtype=author&amp;query=Ju%2C+M">Mingxuan Ju</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zihao Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+T">Tong Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Ghazizadeh%2C+E">Elham Ghazizadeh</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+Y">Yan Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Shah%2C+N">Neil Shah</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yozen Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.14682v1-abstract-short" style="display: inline;"> Improving recommendation systems (RS) can greatly enhance the user experience across many domains, such as social media. Many RS utilize embedding-based retrieval (EBR) approaches to retrieve candidates for recommendation. In an EBR system, the embedding quality is key. According to recent literature, self-supervised multitask learning (SSMTL) has showed strong performance on academic benchmarks i&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.14682v1-abstract-full').style.display = 'inline'; document.getElementById('2409.14682v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.14682v1-abstract-full" style="display: none;"> Improving recommendation systems (RS) can greatly enhance the user experience across many domains, such as social media. Many RS utilize embedding-based retrieval (EBR) approaches to retrieve candidates for recommendation. In an EBR system, the embedding quality is key. According to recent literature, self-supervised multitask learning (SSMTL) has showed strong performance on academic benchmarks in embedding learning and resulted in an overall improvement in multiple downstream tasks, demonstrating a larger resilience to the adverse conditions between each downstream task and thereby increased robustness and task generalization ability through the training objective. However, whether or not the success of SSMTL in academia as a robust training objectives translates to large-scale (i.e., over hundreds of million users and interactions in-between) industrial RS still requires verification. Simply adopting academic setups in industrial RS might entail two issues. Firstly, many self-supervised objectives require data augmentations (e.g., embedding masking/corruption) over a large portion of users and items, which is prohibitively expensive in industrial RS. Furthermore, some self-supervised objectives might not align with the recommendation task, which might lead to redundant computational overheads or negative transfer. In light of these two challenges, we evaluate using a robust training objective, specifically SSMTL, through a large-scale friend recommendation system on a social media platform in the tech sector, identifying whether this increase in robustness can work at scale in enhancing retrieval in the production setting. Through online A/B testing with SSMTL-based EBR, we observe statistically significant increases in key metrics in the friend recommendations, with up to 5.45% improvements in new friends made and 1.91% improvements in new friends made with cold-start users. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.14682v1-abstract-full').style.display = 'none'; document.getElementById('2409.14682v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">RobustRecSys workshop @ RecSys 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.12191</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Qwen2-VL: Enhancing Vision-Language Model&#39;s Perception of the World at Any Resolution </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+P">Peng Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Bai%2C+S">Shuai Bai</a>, <a href="/search/cs?searchtype=author&amp;query=Tan%2C+S">Sinan Tan</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+S">Shijie Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhihao Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Bai%2C+J">Jinze Bai</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+K">Keqin Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+X">Xuejing Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+J">Jialin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Ge%2C+W">Wenbin Ge</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Y">Yang Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Dang%2C+K">Kai Dang</a>, <a href="/search/cs?searchtype=author&amp;query=Du%2C+M">Mengfei Du</a>, <a href="/search/cs?searchtype=author&amp;query=Ren%2C+X">Xuancheng Ren</a>, <a href="/search/cs?searchtype=author&amp;query=Men%2C+R">Rui Men</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+D">Dayiheng Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+C">Chang Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+J">Jingren Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+J">Junyang Lin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.12191v2-abstract-short" style="display: inline;"> We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more eff&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.12191v2-abstract-full').style.display = 'inline'; document.getElementById('2409.12191v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.12191v2-abstract-full" style="display: none;"> We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model&#39;s visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at . <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.12191v2-abstract-full').style.display = 'none'; document.getElementById('2409.12191v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 18 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Code is available at arXiv admin note: text overlap with arXiv:2408.15262 by other authors</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.09740</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> VGG-Tex: A Vivid Geometry-Guided Facial Texture Estimation Model for High Fidelity Monocular 3D Face Reconstruction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wu%2C+H">Haoyu Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Peng%2C+Z">Ziqiao Peng</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+X">Xukun Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Cheng%2C+Y">Yunfei Cheng</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+J">Jun He</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+H">Hongyan Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhaoxin Fan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.09740v2-abstract-short" style="display: inline;"> 3D face reconstruction from monocular images has promoted the development of various applications such as augmented reality. Though existing methods have made remarkable progress, most of them emphasize geometric reconstruction, while overlooking the importance of texture prediction. To address this issue, we propose VGG-Tex, a novel Vivid Geometry-Guided Facial Texture Estimation model designed f&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.09740v2-abstract-full').style.display = 'inline'; document.getElementById('2409.09740v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.09740v2-abstract-full" style="display: none;"> 3D face reconstruction from monocular images has promoted the development of various applications such as augmented reality. Though existing methods have made remarkable progress, most of them emphasize geometric reconstruction, while overlooking the importance of texture prediction. To address this issue, we propose VGG-Tex, a novel Vivid Geometry-Guided Facial Texture Estimation model designed for High Fidelity Monocular 3D Face Reconstruction. The core of this approach is leveraging 3D parametric priors to enhance the outcomes of 2D UV texture estimation. Specifically, VGG-Tex includes a Facial Attributes Encoding Module, a Geometry-Guided Texture Generator, and a Visibility-Enhanced Texture Completion Module. These components are responsible for extracting parametric priors, generating initial textures, and refining texture details, respectively. Based on the geometry-texture complementarity principle, VGG-Tex also introduces a Texture-guided Geometry Refinement Module to further balance the overall fidelity of the reconstructed 3D faces, along with corresponding losses. Comprehensive experiments demonstrate that our method significantly improves texture reconstruction performance compared to existing state-of-the-art methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.09740v2-abstract-full').style.display = 'none'; document.getElementById('2409.09740v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.06559</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Optimization and Control">math.OC</span> </div> </div> <p class="title is-5 mathjax"> Learn2Aggregate: Supervised Generation of Chv谩tal-Gomory Cuts Using Graph Neural Networks </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Deza%2C+A">Arnaud Deza</a>, <a href="/search/cs?searchtype=author&amp;query=Khalil%2C+E+B">Elias B. Khalil</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhenan Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Z">Zirui Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yong Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.06559v1-abstract-short" style="display: inline;"> We present $\textit{Learn2Aggregate}$, a machine learning (ML) framework for optimizing the generation of Chv谩tal-Gomory (CG) cuts in mixed integer linear programming (MILP). The framework trains a graph neural network to classify useful constraints for aggregation in CG cut generation. The ML-driven CG separator selectively focuses on a small set of impactful constraints, improving runtimes witho&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.06559v1-abstract-full').style.display = 'inline'; document.getElementById('2409.06559v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.06559v1-abstract-full" style="display: none;"> We present $\textit{Learn2Aggregate}$, a machine learning (ML) framework for optimizing the generation of Chv谩tal-Gomory (CG) cuts in mixed integer linear programming (MILP). The framework trains a graph neural network to classify useful constraints for aggregation in CG cut generation. The ML-driven CG separator selectively focuses on a small set of impactful constraints, improving runtimes without compromising the strength of the generated cuts. Key to our approach is the formulation of a constraint classification task which favours sparse aggregation of constraints, consistent with empirical findings. This, in conjunction with a careful constraint labeling scheme and a hybrid of deep learning and feature engineering, results in enhanced CG cut generation across five diverse MILP benchmarks. On the largest test sets, our method closes roughly $\textit{twice}$ as much of the integrality gap as the standard CG method while running 40$% faster. This performance improvement is due to our method eliminating 75% of the constraints prior to aggregation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.06559v1-abstract-full').style.display = 'none'; document.getElementById('2409.06559v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">12 pages, 8 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.04968</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1109/TIFS.2024.3421893 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Natias: Neuron Attribution based Transferable Image Adversarial Steganography </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zexin Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+K">Kejiang Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Zeng%2C+K">Kai Zeng</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+J">Jiansong Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+W">Weiming Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+N">Nenghai Yu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.04968v1-abstract-short" style="display: inline;"> Image steganography is a technique to conceal secret messages within digital images. Steganalysis, on the contrary, aims to detect the presence of secret messages within images. Recently, deep-learning-based steganalysis methods have achieved excellent detection performance. As a countermeasure, adversarial steganography has garnered considerable attention due to its ability to effectively deceive&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.04968v1-abstract-full').style.display = 'inline'; document.getElementById('2409.04968v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.04968v1-abstract-full" style="display: none;"> Image steganography is a technique to conceal secret messages within digital images. Steganalysis, on the contrary, aims to detect the presence of secret messages within images. Recently, deep-learning-based steganalysis methods have achieved excellent detection performance. As a countermeasure, adversarial steganography has garnered considerable attention due to its ability to effectively deceive deep-learning-based steganalysis. However, steganalysts often employ unknown steganalytic models for detection. Therefore, the ability of adversarial steganography to deceive non-target steganalytic models, known as transferability, becomes especially important. Nevertheless, existing adversarial steganographic methods do not consider how to enhance transferability. To address this issue, we propose a novel adversarial steganographic scheme named Natias. Specifically, we first attribute the output of a steganalytic model to each neuron in the target middle layer to identify critical features. Next, we corrupt these critical features that may be adopted by diverse steganalytic models. Consequently, it can promote the transferability of adversarial steganography. Our proposed method can be seamlessly integrated with existing adversarial steganography frameworks. Thorough experimental analyses affirm that our proposed technique possesses improved transferability when contrasted with former approaches, and it attains heightened security in retraining scenarios. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.04968v1-abstract-full').style.display = 'none'; document.getElementById('2409.04968v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by IEEE TIFS</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.17168</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhen Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Dai%2C+P">Peng Dai</a>, <a href="/search/cs?searchtype=author&amp;query=Su%2C+Z">Zhuo Su</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+X">Xu Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Lv%2C+Z">Zheng Lv</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+J">Jiarui Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Du%2C+T">Tianyuan Du</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+G">Guidong Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yang Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.17168v1-abstract-short" style="display: inline;"> Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.17168v1-abstract-full').style.display = 'inline'; document.getElementById('2408.17168v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.17168v1-abstract-full" style="display: none;"> Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal \textbf{E}gocentric human \textbf{M}otion dataset with \textbf{H}ead-Mounted Display (HMD) and body-worn \textbf{I}MUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.17168v1-abstract-full').style.display = 'none'; document.getElementById('2408.17168v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.09357</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Graphics">cs.GR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> Meta-Learning Empowered Meta-Face: Personalized Speaking Style Adaptation for Audio-Driven 3D Talking Face Animation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+X">Xukun Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+F">Fengxin Li</a>, <a href="/search/cs?searchtype=author&amp;query=Peng%2C+Z">Ziqiao Peng</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+K">Kejian Wu</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+J">Jun He</a>, <a href="/search/cs?searchtype=author&amp;query=Qin%2C+B">Biao Qin</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhaoxin Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+H">Hongyan Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.09357v1-abstract-short" style="display: inline;"> Audio-driven 3D face animation is increasingly vital in live streaming and augmented reality applications. While remarkable progress has been observed, most existing approaches are designed for specific individuals with predefined speaking styles, thus neglecting the adaptability to varied speaking styles. To address this limitation, this paper introduces MetaFace, a novel methodology meticulously&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.09357v1-abstract-full').style.display = 'inline'; document.getElementById('2408.09357v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.09357v1-abstract-full" style="display: none;"> Audio-driven 3D face animation is increasingly vital in live streaming and augmented reality applications. While remarkable progress has been observed, most existing approaches are designed for specific individuals with predefined speaking styles, thus neglecting the adaptability to varied speaking styles. To address this limitation, this paper introduces MetaFace, a novel methodology meticulously crafted for speaking style adaptation. Grounded in the novel concept of meta-learning, MetaFace is composed of several key components: the Robust Meta Initialization Stage (RMIS) for fundamental speaking style adaptation, the Dynamic Relation Mining Neural Process (DRMN) for forging connections between observed and unobserved speaking styles, and the Low-rank Matrix Memory Reduction Approach to enhance the efficiency of model optimization as well as learning style details. Leveraging these novel designs, MetaFace not only significantly outperforms robust existing baselines but also establishes a new state-of-the-art, as substantiated by our experimental results. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.09357v1-abstract-full').style.display = 'none'; document.getElementById('2408.09357v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.05233</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Large Language Model based Agent Framework for Electric Vehicle Charging Behavior Simulation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Feng%2C+J">Junkang Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Cui%2C+C">Chenggang Cui</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+C">Chuanlin Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zizhu Fan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.05233v1-abstract-short" style="display: inline;"> This paper introduces a new LLM based agent framework for simulating electric vehicle (EV) charging behavior, integrating user preferences, psychological characteristics, and environmental factors to optimize the charging process. The framework comprises several modules, enabling sophisticated, adaptive simulations. Dynamic decision making is supported by continuous reflection and memory updates,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.05233v1-abstract-full').style.display = 'inline'; document.getElementById('2408.05233v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.05233v1-abstract-full" style="display: none;"> This paper introduces a new LLM based agent framework for simulating electric vehicle (EV) charging behavior, integrating user preferences, psychological characteristics, and environmental factors to optimize the charging process. The framework comprises several modules, enabling sophisticated, adaptive simulations. Dynamic decision making is supported by continuous reflection and memory updates, ensuring alignment with user expectations and enhanced efficiency. The framework&#39;s ability to generate personalized user profiles and real-time decisions offers significant advancements for urban EV charging management. Future work could focus on incorporating more intricate scenarios and expanding data sources to enhance predictive accuracy and practical utility. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.05233v1-abstract-full').style.display = 'none'; document.getElementById('2408.05233v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">7 pages,3 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.01826</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lin%2C+Y">Yihong Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhaoxin Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Xiong%2C+L">Lingyu Xiong</a>, <a href="/search/cs?searchtype=author&amp;query=Peng%2C+L">Liang Peng</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+X">Xiandong Li</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+W">Wenxiong Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+X">Xianjia Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Lei%2C+S">Songju Lei</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+H">Huang Xu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.01826v2-abstract-short" style="display: inline;"> Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.01826v2-abstract-full').style.display = 'inline'; document.getElementById('2408.01826v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.01826v2-abstract-full" style="display: none;"> Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.01826v2-abstract-full').style.display = 'none'; document.getElementById('2408.01826v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 3 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">9 pages, 5 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.01323</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+H">He Zhu</a>, <a href="/search/cs?searchtype=author&amp;query=Su%2C+J">Junyou Su</a>, <a href="/search/cs?searchtype=author&amp;query=Lun%2C+T">Tianle Lun</a>, <a href="/search/cs?searchtype=author&amp;query=Tao%2C+Y">Yicheng Tao</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+W">Wenjia Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zipei Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+G">Guanhua Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.01323v1-abstract-short" style="display: inline;"> Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce FANNO, a fully autonomous, open-sourced framework that&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.01323v1-abstract-full').style.display = 'inline'; document.getElementById('2408.01323v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.01323v1-abstract-full" style="display: none;"> Instruction fine-tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly API calls of proprietary LLMs. To address these challenges, we introduce FANNO, a fully autonomous, open-sourced framework that revolutionizes the annotation process without the need for pre-existing annotated data. Utilizing a Mistral-7b-instruct model, FANNO efficiently produces diverse and high-quality datasets through a structured process involving document pre-screening, instruction generation, and response generation. Experiments on Open LLM Leaderboard and AlpacaEval benchmark show that the FANNO can generate high-quality data with diversity and complexity for free, comparable to human-annotated or cleaned datasets like Alpaca-GPT4-Cleaned. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.01323v1-abstract-full').style.display = 'none'; document.getElementById('2408.01323v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.21581</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> InScope: A New Real-world 3D Infrastructure-side Collaborative Perception Dataset for Open Traffic Scenarios </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xiaofei Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Yining Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+J">Jinping Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Qin%2C+X">Xiangyi Qin</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+Y">Ying Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhengping Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Tan%2C+X">Xiaojun Tan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.21581v1-abstract-short" style="display: inline;"> Perception systems of autonomous vehicles are susceptible to occlusion, especially when examined from a vehicle-centric perspective. Such occlusion can lead to overlooked object detections, e.g., larger vehicles such as trucks or buses may create blind spots where cyclists or pedestrians could be obscured, accentuating the safety concerns associated with such perception system limitations. To miti&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.21581v1-abstract-full').style.display = 'inline'; document.getElementById('2407.21581v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.21581v1-abstract-full" style="display: none;"> Perception systems of autonomous vehicles are susceptible to occlusion, especially when examined from a vehicle-centric perspective. Such occlusion can lead to overlooked object detections, e.g., larger vehicles such as trucks or buses may create blind spots where cyclists or pedestrians could be obscured, accentuating the safety concerns associated with such perception system limitations. To mitigate these challenges, the vehicle-to-everything (V2X) paradigm suggests employing an infrastructure-side perception system (IPS) to complement autonomous vehicles with a broader perceptual scope. Nevertheless, the scarcity of real-world 3D infrastructure-side datasets constrains the advancement of V2X technologies. To bridge these gaps, this paper introduces a new 3D infrastructure-side collaborative perception dataset, abbreviated as inscope. Notably, InScope is the first dataset dedicated to addressing occlusion challenges by strategically deploying multiple-position Light Detection and Ranging (LiDAR) systems on the infrastructure side. Specifically, InScope encapsulates a 20-day capture duration with 303 tracking trajectories and 187,787 3D bounding boxes annotated by experts. Through analysis of benchmarks, four different benchmarks are presented for open traffic scenarios, including collaborative 3D object detection, multisource data fusion, data domain transfer, and 3D multiobject tracking tasks. Additionally, a new metric is designed to quantify the impact of occlusion, facilitating the evaluation of detection degradation ratios among various algorithms. The Experimental findings showcase the enhanced performance of leveraging InScope to assist in detecting and tracking 3D multiobjects in real-world scenarios, particularly in tracking obscured, small, and distant objects. The dataset and benchmarks are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.21581v1-abstract-full').style.display = 'none'; document.getElementById('2407.21581v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 31 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.19365</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> </div> </div> <p class="title is-5 mathjax"> Seamless Website Fingerprinting in Multiple Environments </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Song%2C+C">Chuxu Song</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zining Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+H">Hao Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Martin%2C+R">Richard Martin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.19365v1-abstract-short" style="display: inline;"> Website fingerprinting (WF) attacks identify the websites visited over anonymized connections by analyzing patterns in network traffic flows, such as packet sizes, directions, or interval times using a machine learning classifier. Previous studies showed WF attacks achieve high classification accuracy. However, several issues call into question whether existing WF approaches are realizable in prac&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.19365v1-abstract-full').style.display = 'inline'; document.getElementById('2407.19365v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.19365v1-abstract-full" style="display: none;"> Website fingerprinting (WF) attacks identify the websites visited over anonymized connections by analyzing patterns in network traffic flows, such as packet sizes, directions, or interval times using a machine learning classifier. Previous studies showed WF attacks achieve high classification accuracy. However, several issues call into question whether existing WF approaches are realizable in practice and thus motivate a re-exploration. Due to Tor&#39;s performance issues and resulting poor browsing experience, the vast majority of users opt for Virtual Private Networking (VPN) despite VPNs weaker privacy protections. Many other past assumptions are increasingly unrealistic as web technology advances. Our work addresses several key limitations of prior art. First, we introduce a new approach that classifies entire websites rather than individual web pages. Site-level classification uses traffic from all site components, including advertisements, multimedia, and single-page applications. Second, our Convolutional Neural Network (CNN) uses only the jitter and size of 500 contiguous packets from any point in a TCP stream, in contrast to prior work requiring heuristics to find page boundaries. Our seamless approach makes eavesdropper attack models realistic. Using traces from a controlled browser, we show our CNN matches observed traffic to a website with over 90% accuracy. We found the training traffic quality is critical as classification accuracy is significantly reduced when the training data lacks variability in network location, performance, and clients&#39; computational capability. We enhanced the base CNN&#39;s efficacy using domain adaptation, allowing it to discount irrelevant features, such as network location. Lastly, we evaluate several defensive strategies against seamless WF attacks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.19365v1-abstract-full').style.display = 'none'; document.getElementById('2407.19365v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">16 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.13331</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Reconstruct the Pruned Model without Any Retraining </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+P">Pingjie Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Ziqing Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+S">Shengchao Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhe Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yanfeng Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yu Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.13331v1-abstract-short" style="display: inline;"> Structured pruning is a promising hardware-friendly compression technique for large language models (LLMs), which is expected to be retraining-free to avoid the enormous retraining cost. This retraining-free paradigm involves (1) pruning criteria to define the architecture and (2) distortion reconstruction to restore performance. However, existing methods often emphasize pruning criteria while usi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.13331v1-abstract-full').style.display = 'inline'; document.getElementById('2407.13331v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.13331v1-abstract-full" style="display: none;"> Structured pruning is a promising hardware-friendly compression technique for large language models (LLMs), which is expected to be retraining-free to avoid the enormous retraining cost. This retraining-free paradigm involves (1) pruning criteria to define the architecture and (2) distortion reconstruction to restore performance. However, existing methods often emphasize pruning criteria while using reconstruction techniques that are specific to certain modules or criteria, resulting in limited generalizability. To address this, we introduce the Linear Interpolation-based Adaptive Reconstruction (LIAR) framework, which is both efficient and effective. LIAR does not require back-propagation or retraining and is compatible with various pruning criteria and modules. By applying linear interpolation to the preserved weights, LIAR minimizes reconstruction error and effectively reconstructs the pruned output. Our evaluations on benchmarks such as GLUE, SQuAD, WikiText, and common sense reasoning show that LIAR enables a BERT model to maintain 98% accuracy even after removing 50% of its parameters and achieves top performance for LLaMA in just a few minutes. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.13331v1-abstract-full').style.display = 'none'; document.getElementById('2407.13331v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">18 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.10671</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Qwen2 Technical Report </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+A">An Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+B">Baosong Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Hui%2C+B">Binyuan Hui</a>, <a href="/search/cs?searchtype=author&amp;query=Zheng%2C+B">Bo Zheng</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+B">Bowen Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+C">Chang Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+C">Chengpeng Li</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+C">Chengyuan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+D">Dayiheng Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+F">Fei Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Dong%2C+G">Guanting Dong</a>, <a href="/search/cs?searchtype=author&amp;query=Wei%2C+H">Haoran Wei</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+H">Huan Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+J">Jialong Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+J">Jialin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+J">Jian Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Tu%2C+J">Jianhong Tu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+J">Jianwei Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+J">Jianxin Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+J">Jianxin Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+J">Jin Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+J">Jingren Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Bai%2C+J">Jinze Bai</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+J">Jinzheng He</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+J">Junyang Lin</a> , et al. (37 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.10671v4-abstract-short" style="display: inline;"> This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, a&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.10671v4-abstract-full').style.display = 'inline'; document.getElementById('2407.10671v4-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.10671v4-abstract-full" style="display: none;"> This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.10671v4-abstract-full').style.display = 'none'; document.getElementById('2407.10671v4-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">26 pages, 1 figure</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.10241</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhiting Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+R">Ruizhe Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+R">Ruiling Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Z">Zuozhu Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.10241v2-abstract-short" style="display: inline;"> Evaluating the bias in Large Language Models (LLMs) becomes increasingly crucial with their rapid development. However, existing evaluation methods rely on fixed-form outputs and cannot adapt to the flexible open-text generation scenarios of LLMs (e.g., sentence completion and question answering). To address this, we introduce BiasAlert, a plug-and-play tool designed to detect social bias in open-&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.10241v2-abstract-full').style.display = 'inline'; document.getElementById('2407.10241v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.10241v2-abstract-full" style="display: none;"> Evaluating the bias in Large Language Models (LLMs) becomes increasingly crucial with their rapid development. However, existing evaluation methods rely on fixed-form outputs and cannot adapt to the flexible open-text generation scenarios of LLMs (e.g., sentence completion and question answering). To address this, we introduce BiasAlert, a plug-and-play tool designed to detect social bias in open-text generations of LLMs. BiasAlert integrates external human knowledge with inherent reasoning capabilities to detect bias reliably. Extensive experiments demonstrate that BiasAlert significantly outperforms existing state-of-the-art methods like GPT4-as-A-Judge in detecting bias. Furthermore, through application studies, we demonstrate the utility of BiasAlert in reliable LLM bias evaluation and bias mitigation across various scenarios. Model and code will be publicly released. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.10241v2-abstract-full').style.display = 'none'; document.getElementById('2407.10241v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 14 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.10098</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Operating Systems">cs.OS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Hardware Architecture">cs.AR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Distributed, Parallel, and Cluster Computing">cs.DC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Networking and Internet Architecture">cs.NI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Performance">cs.PF</span> </div> </div> <p class="title is-5 mathjax"> Accelerator-as-a-Service in Public Clouds: An Intra-Host Traffic Management View for Performance Isolation in the Wild </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+J">Jiechen Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Shu%2C+R">Ran Shu</a>, <a href="/search/cs?searchtype=author&amp;query=Lim%2C+K">Katie Lim</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zewen Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Anderson%2C+T">Thomas Anderson</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+M">Mingyu Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Jerger%2C+N+E">Natalie Enright Jerger</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.10098v1-abstract-short" style="display: inline;"> I/O devices in public clouds have integrated increasing numbers of hardware accelerators, e.g., AWS Nitro, Azure FPGA and Nvidia BlueField. However, such specialized compute (1) is not explicitly accessible to cloud users with performance guarantee, (2) cannot be leveraged simultaneously by both providers and users, unlike general-purpose compute (e.g., CPUs). Through ten observations, we present&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.10098v1-abstract-full').style.display = 'inline'; document.getElementById('2407.10098v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.10098v1-abstract-full" style="display: none;"> I/O devices in public clouds have integrated increasing numbers of hardware accelerators, e.g., AWS Nitro, Azure FPGA and Nvidia BlueField. However, such specialized compute (1) is not explicitly accessible to cloud users with performance guarantee, (2) cannot be leveraged simultaneously by both providers and users, unlike general-purpose compute (e.g., CPUs). Through ten observations, we present that the fundamental difficulty of democratizing accelerators is insufficient performance isolation support. The key obstacles to enforcing accelerator isolation are (1) too many unknown traffic patterns in public clouds and (2) too many possible contention sources in the datapath. In this work, instead of scheduling such complex traffic on-the-fly and augmenting isolation support on each system component, we propose to model traffic as network flows and proactively re-shape the traffic to avoid unpredictable contention. We discuss the implications of our findings on the design of future I/O management stacks and device interfaces. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.10098v1-abstract-full').style.display = 'none'; document.getElementById('2407.10098v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.04064</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Robotics">cs.RO</span> </div> </div> <p class="title is-5 mathjax"> Collision Avoidance for Multiple UAVs in Unknown Scenarios with Causal Representation Disentanglement </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhuang%2C+J">Jiafan Zhuang</a>, <a href="/search/cs?searchtype=author&amp;query=Xia%2C+Z">Zihao Xia</a>, <a href="/search/cs?searchtype=author&amp;query=Han%2C+G">Gaofei Han</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+B">Boxi Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+W">Wenji Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+D">Dongliang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Hao%2C+Z">Zhifeng Hao</a>, <a href="/search/cs?searchtype=author&amp;query=Cai%2C+R">Ruichu Cai</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhun Fan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.04064v2-abstract-short" style="display: inline;"> Deep reinforcement learning (DRL) has achieved remarkable progress in online path planning tasks for multi-UAV systems. However, existing DRL-based methods often suffer from performance degradation when tackling unseen scenarios, since the non-causal factors in visual representations adversely affect policy learning. To address this issue, we propose a novel representation learning approach, \ie,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.04064v2-abstract-full').style.display = 'inline'; document.getElementById('2407.04064v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.04064v2-abstract-full" style="display: none;"> Deep reinforcement learning (DRL) has achieved remarkable progress in online path planning tasks for multi-UAV systems. However, existing DRL-based methods often suffer from performance degradation when tackling unseen scenarios, since the non-causal factors in visual representations adversely affect policy learning. To address this issue, we propose a novel representation learning approach, \ie, causal representation disentanglement, which can identify the causal and non-causal factors in representations. After that, we only pass causal factors for subsequent policy learning and thus explicitly eliminate the influence of non-causal factors, which effectively improves the generalization ability of DRL models. Experimental results show that our proposed method can achieve robust navigation performance and effective collision avoidance especially in unseen scenarios, which significantly outperforms existing SOTA algorithms. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.04064v2-abstract-full').style.display = 'none'; document.getElementById('2407.04064v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 4 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.04056</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Robotics">cs.RO</span> </div> </div> <p class="title is-5 mathjax"> Robust Policy Learning for Multi-UAV Collision Avoidance with Causal Feature Selection </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhuang%2C+J">Jiafan Zhuang</a>, <a href="/search/cs?searchtype=author&amp;query=Han%2C+G">Gaofei Han</a>, <a href="/search/cs?searchtype=author&amp;query=Xia%2C+Z">Zihao Xia</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+B">Boxi Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+W">Wenji Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+D">Dongliang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Hao%2C+Z">Zhifeng Hao</a>, <a href="/search/cs?searchtype=author&amp;query=Cai%2C+R">Ruichu Cai</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhun Fan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.04056v2-abstract-short" style="display: inline;"> In unseen and complex outdoor environments, collision avoidance navigation for unmanned aerial vehicle (UAV) swarms presents a challenging problem. It requires UAVs to navigate through various obstacles and complex backgrounds. Existing collision avoidance navigation methods based on deep reinforcement learning show promising performance but suffer from poor generalization abilities, resulting in&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.04056v2-abstract-full').style.display = 'inline'; document.getElementById('2407.04056v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.04056v2-abstract-full" style="display: none;"> In unseen and complex outdoor environments, collision avoidance navigation for unmanned aerial vehicle (UAV) swarms presents a challenging problem. It requires UAVs to navigate through various obstacles and complex backgrounds. Existing collision avoidance navigation methods based on deep reinforcement learning show promising performance but suffer from poor generalization abilities, resulting in performance degradation in unseen environments. To address this issue, we investigate the cause of weak generalization ability in DRL and propose a novel causal feature selection module. This module can be integrated into the policy network and effectively filters out non-causal factors in representations, thereby reducing the influence of spurious correlations between non-causal factors and action predictions. Experimental results demonstrate that our proposed method can achieve robust navigation performance and effective collision avoidance especially in scenarios with unseen backgrounds and obstacles, which significantly outperforms existing state-of-the-art algorithms. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.04056v2-abstract-full').style.display = 'none'; document.getElementById('2407.04056v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 4 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.03204</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Expressive Gaussian Human Avatars from Monocular RGB Video </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Hu%2C+H">Hezhen Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhiwen Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+T">Tianhao Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Xi%2C+Y">Yihan Xi</a>, <a href="/search/cs?searchtype=author&amp;query=Lee%2C+S">Seoyoung Lee</a>, <a href="/search/cs?searchtype=author&amp;query=Pavlakos%2C+G">Georgios Pavlakos</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zhangyang Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.03204v1-abstract-short" style="display: inline;"> Nuanced expressiveness, particularly through fine-grained hand and facial expressions, is pivotal for enhancing the realism and vitality of digital human representations. In this work, we focus on investigating the expressiveness of human avatars when learned from monocular RGB video; a setting that introduces new challenges in capturing and animating fine-grained details. To this end, we introduc&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.03204v1-abstract-full').style.display = 'inline'; document.getElementById('2407.03204v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.03204v1-abstract-full" style="display: none;"> Nuanced expressiveness, particularly through fine-grained hand and facial expressions, is pivotal for enhancing the realism and vitality of digital human representations. In this work, we focus on investigating the expressiveness of human avatars when learned from monocular RGB video; a setting that introduces new challenges in capturing and animating fine-grained details. To this end, we introduce EVA, a drivable human model that meticulously sculpts fine details based on 3D Gaussians and SMPL-X, an expressive parametric human model. Focused on enhancing expressiveness, our work makes three key contributions. First, we highlight the critical importance of aligning the SMPL-X model with RGB frames for effective avatar learning. Recognizing the limitations of current SMPL-X prediction methods for in-the-wild videos, we introduce a plug-and-play module that significantly ameliorates misalignment issues. Second, we propose a context-aware adaptive density control strategy, which is adaptively adjusting the gradient thresholds to accommodate the varied granularity across body parts. Last but not least, we develop a feedback mechanism that predicts per-pixel confidence to better guide the learning of 3D Gaussians. Extensive experiments on two benchmarks demonstrate the superiority of our framework both quantitatively and qualitatively, especially on the fine-grained hand and facial details. See the project website at \url{} <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.03204v1-abstract-full').style.display = 'none'; document.getElementById('2407.03204v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.01607</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Multi-Epoch learning with Data Augmentation for Deep Click-Through Rate Prediction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhongxiang Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Z">Zhaocheng Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Liang%2C+J">Jian Liang</a>, <a href="/search/cs?searchtype=author&amp;query=Kong%2C+D">Dongying Kong</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+H">Han Li</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+P">Peng Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+S">Shuang Li</a>, <a href="/search/cs?searchtype=author&amp;query=Gai%2C+K">Kun Gai</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.01607v1-abstract-short" style="display: inline;"> This paper investigates the one-epoch overfitting phenomenon in Click-Through Rate (CTR) models, where performance notably declines at the start of the second epoch. Despite extensive research, the efficacy of multi-epoch training over the conventional one-epoch approach remains unclear. We identify the overfitting of the embedding layer, caused by high-dimensional data sparsity, as the primary is&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.01607v1-abstract-full').style.display = 'inline'; document.getElementById('2407.01607v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.01607v1-abstract-full" style="display: none;"> This paper investigates the one-epoch overfitting phenomenon in Click-Through Rate (CTR) models, where performance notably declines at the start of the second epoch. Despite extensive research, the efficacy of multi-epoch training over the conventional one-epoch approach remains unclear. We identify the overfitting of the embedding layer, caused by high-dimensional data sparsity, as the primary issue. To address this, we introduce a novel and simple Multi-Epoch learning with Data Augmentation (MEDA) framework, suitable for both non-continual and continual learning scenarios, which can be seamlessly integrated into existing deep CTR models and may have potential applications to handle the &#34;forgetting or overfitting&#34; dilemma in the retraining and the well-known catastrophic forgetting problems. MEDA minimizes overfitting by reducing the dependency of the embedding layer on subsequent training data or the Multi-Layer Perceptron (MLP) layers, and achieves data augmentation through training the MLP with varied embedding spaces. Our findings confirm that pre-trained MLP layers can adapt to new embedding spaces, enhancing performance without overfitting. This adaptability underscores the MLP layers&#39; role in learning a matching function focused on the relative relationships among embeddings rather than their absolute positions. To our knowledge, MEDA represents the first multi-epoch training strategy tailored for deep CTR prediction models. We conduct extensive experiments on several public and business datasets, and the effectiveness of data augmentation and superiority over conventional single-epoch training are fully demonstrated. Besides, MEDA has exhibited significant benefits in a real-world online advertising system. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.01607v1-abstract-full').style.display = 'none'; document.getElementById('2407.01607v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.01301</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> GaussianStego: A Generalizable Stenography Pipeline for Generative 3D Gaussians Splatting </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+C">Chenxin Li</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+H">Hengyu Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhiwen Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+W">Wuyang Li</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yifan Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Pan%2C+P">Panwang Pan</a>, <a href="/search/cs?searchtype=author&amp;query=Yuan%2C+Y">Yixuan Yuan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.01301v1-abstract-short" style="display: inline;"> Recent advancements in large generative models and real-time neural rendering using point-based techniques pave the way for a future of widespread visual data distribution through sharing synthesized 3D assets. However, while standardized methods for embedding proprietary or copyright information, either overtly or subtly, exist for conventional visual content such as images and videos, this issue&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.01301v1-abstract-full').style.display = 'inline'; document.getElementById('2407.01301v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.01301v1-abstract-full" style="display: none;"> Recent advancements in large generative models and real-time neural rendering using point-based techniques pave the way for a future of widespread visual data distribution through sharing synthesized 3D assets. However, while standardized methods for embedding proprietary or copyright information, either overtly or subtly, exist for conventional visual content such as images and videos, this issue remains unexplored for emerging generative 3D formats like Gaussian Splatting. We present GaussianStego, a method for embedding steganographic information in the rendering of generated 3D assets. Our approach employs an optimization framework that enables the accurate extraction of hidden information from images rendered using Gaussian assets derived from large models, while maintaining their original visual quality. We conduct preliminary evaluations of our method across several potential deployment scenarios and discuss issues identified through analysis. GaussianStego represents an initial exploration into the novel challenge of embedding customizable, imperceptible, and recoverable information within the renders produced by current 3D generative models, while ensuring minimal impact on the rendered content&#39;s quality. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.01301v1-abstract-full').style.display = 'none'; document.getElementById('2407.01301v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project website:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.16137</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> MLPHand: Real Time Multi-View 3D Hand Mesh Reconstruction via MLP Modeling </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+J">Jian Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+J">Jiakun Li</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+G">Guoming Li</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+Z">Zhen Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+H">Huai-Yu Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhaoxin Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+H">Heng Huang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.16137v1-abstract-short" style="display: inline;"> Multi-view hand mesh reconstruction is a critical task for applications in virtual reality and human-computer interaction, but it remains a formidable challenge. Although existing multi-view hand reconstruction methods achieve remarkable accuracy, they typically come with an intensive computational burden that hinders real-time inference. To this end, we propose MLPHand, a novel method designed fo&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.16137v1-abstract-full').style.display = 'inline'; document.getElementById('2406.16137v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.16137v1-abstract-full" style="display: none;"> Multi-view hand mesh reconstruction is a critical task for applications in virtual reality and human-computer interaction, but it remains a formidable challenge. Although existing multi-view hand reconstruction methods achieve remarkable accuracy, they typically come with an intensive computational burden that hinders real-time inference. To this end, we propose MLPHand, a novel method designed for real-time multi-view single hand reconstruction. MLP Hand consists of two primary modules: (1) a lightweight MLP-based Skeleton2Mesh model that efficiently recovers hand meshes from hand skeletons, and (2) a multi-view geometry feature fusion prediction module that enhances the Skeleton2Mesh model with detailed geometric information from multiple views. Experiments on three widely used datasets demonstrate that MLPHand can reduce computational complexity by 90% while achieving comparable reconstruction accuracy to existing state-of-the-art baselines. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.16137v1-abstract-full').style.display = 'none'; document.getElementById('2406.16137v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 23 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.14977</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> </div> </div> <p class="title is-5 mathjax"> Trustworthy Enhanced Multi-view Multi-modal Alzheimer&#39;s Disease Prediction with Brain-wide Imaging Transcriptomics Data </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Cong%2C+S">Shan Cong</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhoujie Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+H">Hongwei Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yinghan Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+X">Xin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+H">Haoran Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Yao%2C+X">Xiaohui Yao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.14977v1-abstract-short" style="display: inline;"> Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer&#39;s disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.14977v1-abstract-full').style.display = 'inline'; document.getElementById('2406.14977v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.14977v1-abstract-full" style="display: none;"> Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer&#39;s disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities, most studies overlook the informativeness disparities between modalities. Here, we propose TMM, a trusted multiview multimodal graph attention framework for AD diagnosis, using extensive brain-wide transcriptomics and imaging data. First, we construct view-specific brain regional co-function networks (RRIs) from transcriptomics and multimodal radiomics data to incorporate interaction information from both biomolecular and imaging perspectives. Next, we apply graph attention (GAT) processing to each RRI network to produce graph embeddings and employ cross-modal attention to fuse transcriptomics-derived embedding with each imagingderived embedding. Finally, a novel true-false-harmonized class probability (TFCP) strategy is designed to assess and adaptively adjust the prediction confidence of each modality for AD diagnosis. We evaluate TMM using the AHBA database with brain-wide transcriptomics data and the ADNI database with three imaging modalities (AV45-PET, FDG-PET, and VBM-MRI). The results demonstrate the superiority of our method in identifying AD, EMCI, and LMCI compared to state-of-the-arts. Code and data are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.14977v1-abstract-full').style.display = 'none'; document.getElementById('2406.14977v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.14859</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+S">Siyuan Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Long%2C+Z">Zhuohan Long</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhihao Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Wei%2C+Z">Zhongyu Wei</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.14859v1-abstract-short" style="display: inline;"> The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies. Compared to the more advanced state of&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.14859v1-abstract-full').style.display = 'inline'; document.getElementById('2406.14859v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.14859v1-abstract-full" style="display: none;"> The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies. Compared to the more advanced state of unimodal jailbreaking, multimodal domain remains underexplored. We summarize the limitations and potential research directions of multimodal jailbreaking, aiming to inspire future research and further enhance the robustness and security of MLLMs. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.14859v1-abstract-full').style.display = 'none'; document.getElementById('2406.14859v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.13527</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> 4K4DGen: Panoramic 4D Generation at 4K Resolution </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+R">Renjie Li</a>, <a href="/search/cs?searchtype=author&amp;query=Pan%2C+P">Panwang Pan</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+B">Bangbang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+D">Dejia Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+S">Shijie Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xuanyang Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zeming Li</a>, <a href="/search/cs?searchtype=author&amp;query=Kadambi%2C+A">Achuta Kadambi</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zhangyang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Tu%2C+Z">Zhengzhong Tu</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhiwen Fan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.13527v3-abstract-short" style="display: inline;"> The blooming of virtual reality and augmented reality (VR/AR) technologies has driven an increasing demand for the creation of high-quality, immersive, and dynamic environments. However, existing generative techniques either focus solely on dynamic objects or perform outpainting from a single perspective image, failing to meet the requirements of VR/AR applications that need free-viewpoint, 360&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.13527v3-abstract-full').style.display = 'inline'; document.getElementById('2406.13527v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.13527v3-abstract-full" style="display: none;"> The blooming of virtual reality and augmented reality (VR/AR) technologies has driven an increasing demand for the creation of high-quality, immersive, and dynamic environments. However, existing generative techniques either focus solely on dynamic objects or perform outpainting from a single perspective image, failing to meet the requirements of VR/AR applications that need free-viewpoint, 360$^{\circ}$ virtual views where users can move in all directions. In this work, we tackle the challenging task of elevating a single panorama to an immersive 4D experience. For the first time, we demonstrate the capability to generate omnidirectional dynamic scenes with 360$^{\circ}$ views at 4K (4096 $\times$ 2048) resolution, thereby providing an immersive user experience. Our method introduces a pipeline that facilitates natural scene animations and optimizes a set of dynamic Gaussians using efficient splatting techniques for real-time exploration. To overcome the lack of scene-scale annotated 4D data and models, especially in panoramic formats, we propose a novel \textbf{Panoramic Denoiser} that adapts generic 2D diffusion priors to animate consistently in 360$^{\circ}$ images, transforming them into panoramic videos with dynamic scenes at targeted regions. Subsequently, we propose \textbf{Dynamic Panoramic Lifting} to elevate the panoramic video into a 4D immersive environment while preserving spatial and temporal consistency. By transferring prior knowledge from 2D models in the perspective domain to the panoramic domain and the 4D lifting with spatial appearance and geometry regularization, we achieve high-quality Panorama-to-4D generation at a resolution of 4K for the first time. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.13527v3-abstract-full').style.display = 'none'; document.getElementById('2406.13527v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 19 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.12459</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Pan%2C+P">Panwang Pan</a>, <a href="/search/cs?searchtype=author&amp;query=Su%2C+Z">Zhuo Su</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+C">Chenguo Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Fan%2C+Z">Zhen Fan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yongjie Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zeming Li</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+T">Tingting Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Mu%2C+Y">Yadong Mu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yebin Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.12459v2-abstract-short" style="display: inline;"> Despite recent advancements in high-fidelity human reconstruction techniques, the requirements for densely captured images or time-consuming per-instance optimization significantly hinder their applications in broader scenarios. To tackle these issues, we present HumanSplat which predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner. In part&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.12459v2-abstract-full').style.display = 'inline'; document.getElementById('2406.12459v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.12459v2-abstract-full" style="display: none;"> Despite recent advancements in high-fidelity human reconstruction techniques, the requirements for densely captured images or time-consuming per-instance optimization significantly hinder their applications in broader scenarios. To tackle these issues, we present HumanSplat which predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner. In particular, HumanSplat comprises a 2D multi-view diffusion model and a latent reconstruction transformer with human structure priors that adeptly integrate geometric priors and semantic features within a unified framework. A hierarchical loss that incorporates human semantic information is further designed to achieve high-fidelity texture modeling and better constrain the estimated multiple views. Comprehensive experiments on standard benchmarks and in-the-wild images demonstrate that HumanSplat surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.12459v2-abstract-full').style.display = 'none'; document.getElementById('2406.12459v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 18 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> </ol> <nav class="pagination is-small is-centered breathe-horizontal" role="navigation" aria-label="pagination"> <a href="" class="pagination-previous is-invisible">Previous </a> <a 