tooltip is-tooltip-top" data-tooltip="Multimedia">cs.MM</span> </div> </div> <p class="title is-5 mathjax"> Interdisciplinary Translations: Sensory Perception as a Universal Language </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xindi Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+X">Xuanyang Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Song%2C+M">Mingdong Song</a>, <a href="/search/cs?searchtype=author&amp;query=Guljajeva%2C+V">Varvara Guljajeva</a>, <a href="/search/cs?searchtype=author&amp;query=Kuchera-Morin%2C+J">JoAnn Kuchera-Morin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05374v1-abstract-short" style="display: inline;"> This paper investigates sensory perception&#39;s pivotal role as a universal communicative bridge across varied cultures and disciplines, and how it manifests its value in the study of media art, human computer interaction and artificial intelligence. By analyzing its function in non-verbal communication through interactive systems, and drawing on the interpretive model in translation studies where &#34;s&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05374v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05374v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05374v1-abstract-full" style="display: none;"> This paper investigates sensory perception&#39;s pivotal role as a universal communicative bridge across varied cultures and disciplines, and how it manifests its value in the study of media art, human computer interaction and artificial intelligence. By analyzing its function in non-verbal communication through interactive systems, and drawing on the interpretive model in translation studies where &#34;sense&#34; acts as a mediation between two languages, this paper illustrates how interdisciplinary communication in media art and human-computer interaction is afforded by the abstract language of human sensory perception. Specific examples from traditional art, interactive media art, HCI, communication, and translation studies demonstrate how sensory feedback translates and conveys meaning across diverse modalities of expression and how it fosters connections between humans, art, and technology. Pertaining to this topic, this paper analyzes the impact of sensory feedback systems in designing interactive experiences, and reveals the guiding role of sensory perception in the design philosophy of AI systems. Overall, the study aims to broaden the understanding of sensory perception&#39;s role in communication, highlighting its significance in the evolution of interactive experiences and its capacity to unify art, science, and the human experience. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05374v1-abstract-full').style.display = 'none'; document.getElementById('2411.05374v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">This paper has been accepted to the International Symposium of Electronic Arts 2024, and the proceedings version will be available at with DOI to be added once published</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.12707</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Distributed, Parallel, and Cluster Computing">cs.DC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Tang%2C+Z">Zhenheng Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xueze Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Yin%2C+Y">Yiming Yin</a>, <a href="/search/cs?searchtype=author&amp;query=Pan%2C+X">Xinglin Pan</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuxin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+X">Xin He</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Q">Qiang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Zeng%2C+R">Rongfei Zeng</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+K">Kaiyong Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+S">Shaohuai Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+A+C">Amelie Chi Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+B">Bo Li</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+B">Bingsheng He</a>, <a href="/search/cs?searchtype=author&amp;query=Chu%2C+X">Xiaowen Chu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.12707v1-abstract-short" style="display: inline;"> To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, incl&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.12707v1-abstract-full').style.display = 'inline'; document.getElementById('2410.12707v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.12707v1-abstract-full" style="display: none;"> To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.12707v1-abstract-full').style.display = 'none'; document.getElementById('2410.12707v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.11013</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Robotics">cs.RO</span> </div> </div> <p class="title is-5 mathjax"> Incorporating Task Progress Knowledge for Subgoal Generation in Robotic Manipulation through Image Edits </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xuhui Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Kuo%2C+Y">Yen-Ling Kuo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.11013v1-abstract-short" style="display: inline;"> Understanding the progress of a task allows humans to not only track what has been done but also to better plan for future goals. We demonstrate TaKSIE, a novel framework that incorporates task progress knowledge into visual subgoal generation for robotic manipulation tasks. We jointly train a recurrent network with a latent diffusion model to generate the next visual subgoal based on the robot&#39;s&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.11013v1-abstract-full').style.display = 'inline'; document.getElementById('2410.11013v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.11013v1-abstract-full" style="display: none;"> Understanding the progress of a task allows humans to not only track what has been done but also to better plan for future goals. We demonstrate TaKSIE, a novel framework that incorporates task progress knowledge into visual subgoal generation for robotic manipulation tasks. We jointly train a recurrent network with a latent diffusion model to generate the next visual subgoal based on the robot&#39;s current observation and the input language command. At execution time, the robot leverages a visual progress representation to monitor the task progress and adaptively samples the next visual subgoal from the model to guide the manipulation policy. We train and validate our model in simulated and real-world robotic tasks, achieving state-of-the-art performance on the CALVIN manipulation benchmark. We find that the inclusion of task progress knowledge can improve the robustness of trained policy for different initial robot poses or various movement speeds during demonstrations. The project website can be found at . <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.11013v1-abstract-full').style.display = 'none'; document.getElementById('2410.11013v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">11 pages, 9 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.05729</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1007/978-3-031-73235-5_9 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xueyang Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Luan%2C+Z">Zhaoliang Luan</a>, <a href="/search/cs?searchtype=author&amp;query=Khoshelham%2C+K">Kourosh Khoshelham</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+B">Bing Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.05729v1-abstract-short" style="display: inline;"> Point cloud registration is a foundational task for 3D alignment and reconstruction applications. While both traditional and learning-based registration approaches have succeeded, leveraging the intrinsic symmetry of point cloud data, including rotation equivariance, has received insufficient attention. This prohibits the model from learning effectively, resulting in a requirement for more trainin&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.05729v1-abstract-full').style.display = 'inline'; document.getElementById('2410.05729v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.05729v1-abstract-full" style="display: none;"> Point cloud registration is a foundational task for 3D alignment and reconstruction applications. While both traditional and learning-based registration approaches have succeeded, leveraging the intrinsic symmetry of point cloud data, including rotation equivariance, has received insufficient attention. This prohibits the model from learning effectively, resulting in a requirement for more training data and increased model complexity. To address these challenges, we propose a graph neural network model embedded with a local Spherical Euclidean 3D equivariance property through SE(3) message passing based propagation. Our model is composed mainly of a descriptor module, equivariant graph layers, match similarity, and the final regression layers. Such modular design enables us to utilize sparsely sampled input points and initialize the descriptor by self-trained or pre-trained geometric feature descriptors easily. Experiments conducted on the 3DMatch and KITTI datasets exhibit the compelling and robust performance of our model compared to state-of-the-art approaches, while the model complexity remains relatively low at the same time. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.05729v1-abstract-full').style.display = 'none'; document.getElementById('2410.05729v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">18 main body pages, and 9 pages for supplementary part</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.19638</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> From Pre-training Corpora to Large Language Models: What Factors Influence LLM Performance in Causal Discovery Tasks? </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Feng%2C+T">Tao Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+L">Lizhen Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Tandon%2C+N">Niket Tandon</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zhuang Li</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoxi Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Haffari%2C+G">Gholamreza Haffari</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.19638v1-abstract-short" style="display: inline;"> Recent advances in artificial intelligence have seen Large Language Models (LLMs) demonstrate notable proficiency in causal discovery tasks. This study explores the factors influencing the performance of LLMs in causal discovery tasks. Utilizing open-source LLMs, we examine how the frequency of causal relations within their pre-training corpora affects their ability to accurately respond to causal&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.19638v1-abstract-full').style.display = 'inline'; document.getElementById('2407.19638v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.19638v1-abstract-full" style="display: none;"> Recent advances in artificial intelligence have seen Large Language Models (LLMs) demonstrate notable proficiency in causal discovery tasks. This study explores the factors influencing the performance of LLMs in causal discovery tasks. Utilizing open-source LLMs, we examine how the frequency of causal relations within their pre-training corpora affects their ability to accurately respond to causal discovery queries. Our findings reveal that a higher frequency of causal mentions correlates with better model performance, suggesting that extensive exposure to causal information during training enhances the models&#39; causal discovery capabilities. Additionally, we investigate the impact of context on the validity of causal relations. Our results indicate that LLMs might exhibit divergent predictions for identical causal relations when presented in different contexts. This paper provides the first comprehensive analysis of how different factors contribute to LLM performance in causal discovery tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.19638v1-abstract-full').style.display = 'none'; document.getElementById('2407.19638v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.04416</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Multimedia">cs.MM</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yuan%2C+Y">Yi Yuan</a>, <a href="/search/cs?searchtype=author&amp;query=Jia%2C+D">Dongya Jia</a>, <a href="/search/cs?searchtype=author&amp;query=Zhuang%2C+X">Xiaobin Zhuang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Y">Yuanzhe Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Z">Zhengxi Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Z">Zhuo Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuping Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuxuan Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+X">Xubo Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiyuan Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Plumbley%2C+M+D">Mark D. Plumbley</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Wenwu Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.04416v3-abstract-short" style="display: inline;"> Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models.&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.04416v3-abstract-full').style.display = 'inline'; document.getElementById('2407.04416v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.04416v3-abstract-full" style="display: none;"> Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks, showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.04416v3-abstract-full').style.display = 'none'; document.getElementById('2407.04416v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 5 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">5 pages with 1 appendix</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.03115</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> $L_p$-norm Distortion-Efficient Adversarial Attack </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+C">Chao Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuan-Gen Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zi-jia Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiangui Kang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.03115v1-abstract-short" style="display: inline;"> Adversarial examples have shown a powerful ability to make a well-trained model misclassified. Current mainstream adversarial attack methods only consider one of the distortions among $L_0$-norm, $L_2$-norm, and $L_\infty$-norm. $L_0$-norm based methods cause large modification on a single pixel, resulting in naked-eye visible detection, while $L_2$-norm and $L_\infty$-norm based methods suffer fr&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.03115v1-abstract-full').style.display = 'inline'; document.getElementById('2407.03115v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.03115v1-abstract-full" style="display: none;"> Adversarial examples have shown a powerful ability to make a well-trained model misclassified. Current mainstream adversarial attack methods only consider one of the distortions among $L_0$-norm, $L_2$-norm, and $L_\infty$-norm. $L_0$-norm based methods cause large modification on a single pixel, resulting in naked-eye visible detection, while $L_2$-norm and $L_\infty$-norm based methods suffer from weak robustness against adversarial defense since they always diffuse tiny perturbations to all pixels. A more realistic adversarial perturbation should be sparse and imperceptible. In this paper, we propose a novel $L_p$-norm distortion-efficient adversarial attack, which not only owns the least $L_2$-norm loss but also significantly reduces the $L_0$-norm distortion. To this aim, we design a new optimization scheme, which first optimizes an initial adversarial perturbation under $L_2$-norm constraint, and then constructs a dimension unimportance matrix for the initial perturbation. Such a dimension unimportance matrix can indicate the adversarial unimportance of each dimension of the initial perturbation. Furthermore, we introduce a new concept of adversarial threshold for the dimension unimportance matrix. The dimensions of the initial perturbation whose unimportance is higher than the threshold will be all set to zero, greatly decreasing the $L_0$-norm distortion. Experimental results on three benchmark datasets show that under the same query budget, the adversarial examples generated by our method have lower $L_0$-norm and $L_2$-norm distortion than the state-of-the-art. Especially for the MNIST dataset, our attack reduces 8.1$\%$ $L_2$-norm distortion meanwhile remaining 47$\%$ pixels unattacked. This demonstrates the superiority of the proposed method over its competitors in terms of adversarial robustness and visual imperceptibility. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.03115v1-abstract-full').style.display = 'none'; document.getElementById('2407.03115v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.20078</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> GM-DF: Generalized Multi-Scenario Deepfake Detection </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lai%2C+Y">Yingxin Lai</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+Z">Zitong Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+J">Jing Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+B">Bin Li</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiangui Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+L">Linlin Shen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.20078v1-abstract-short" style="display: inline;"> Existing face forgery detection usually follows the paradigm of training models in a single domain, which leads to limited generalization capacity when unseen scenarios and unknown attacks occur. In this paper, we elaborately investigate the generalization capacity of deepfake detection models when jointly trained on multiple face forgery detection datasets. We first find a rapid degradation of de&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.20078v1-abstract-full').style.display = 'inline'; document.getElementById('2406.20078v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.20078v1-abstract-full" style="display: none;"> Existing face forgery detection usually follows the paradigm of training models in a single domain, which leads to limited generalization capacity when unseen scenarios and unknown attacks occur. In this paper, we elaborately investigate the generalization capacity of deepfake detection models when jointly trained on multiple face forgery detection datasets. We first find a rapid degradation of detection accuracy when models are directly trained on combined datasets due to the discrepancy across collection scenarios and generation methods. To address the above issue, a Generalized Multi-Scenario Deepfake Detection framework (GM-DF) is proposed to serve multiple real-world scenarios by a unified model. First, we propose a hybrid expert modeling approach for domain-specific real/forgery feature extraction. Besides, as for the commonality representation, we use CLIP to extract the common features for better aligning visual and textual features across domains. Meanwhile, we introduce a masked image reconstruction mechanism to force models to capture rich forged details. Finally, we supervise the models via a domain-aware meta-learning strategy to further enhance their generalization capacities. Specifically, we design a novel domain alignment loss to strongly align the distributions of the meta-test domains and meta-train domains. Thus, the updated models are able to represent both specific and common real/forgery features across multiple datasets. In consideration of the lack of study of multi-dataset training, we establish a new benchmark leveraging multi-source data to fairly evaluate the models&#39; generalization capacity on unseen scenarios. Both qualitative and quantitative experiments on five datasets conducted on traditional protocols as well as the proposed benchmark demonstrate the effectiveness of our approach. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.20078v1-abstract-full').style.display = 'none'; document.getElementById('2406.20078v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.17300</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Feng%2C+T">Tao Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+L">Lizhen Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoxi Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Haffari%2C+G">Gholamreza Haffari</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.17300v1-abstract-short" style="display: inline;"> Automatically evaluating the quality of responses in open-domain dialogue systems is a challenging but crucial task. Current evaluation metrics often fail to align with human judgments, especially when assessing responses that are grammatically correct. To address this issue, we propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength b&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.17300v1-abstract-full').style.display = 'inline'; document.getElementById('2406.17300v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.17300v1-abstract-full" style="display: none;"> Automatically evaluating the quality of responses in open-domain dialogue systems is a challenging but crucial task. Current evaluation metrics often fail to align with human judgments, especially when assessing responses that are grammatically correct. To address this issue, we propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses. The causal strength is estimated by utilizing both unconditional dependence and conditional dependencies from the dialogue history to responses. We compare our metric with the existing competitive metrics in terms of their alignment with human judgements. Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements. Additionally, we collect a new dialogue dataset CGDIALOG+ with human-annotated causal relations and a set of pairwise human judgements to facilitate the development of future automatic metrics. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.17300v1-abstract-full').style.display = 'none'; document.getElementById('2406.17300v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.13217</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Bridging Law and Data: Augmenting Reasoning via a Semi-Structured Dataset with IRAC methodology </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoxi Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+L">Lizhen Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Soon%2C+L">Lay-Ki Soon</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zhuang Li</a>, <a href="/search/cs?searchtype=author&amp;query=Trakic%2C+A">Adnan Trakic</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.13217v1-abstract-short" style="display: inline;"> The effectiveness of Large Language Models (LLMs) in legal reasoning is often limited due to the unique legal terminologies and the necessity for highly specialized knowledge. These limitations highlight the need for high-quality data tailored for complex legal reasoning tasks. This paper introduces LEGALSEMI, a benchmark specifically curated for legal scenario analysis. LEGALSEMI comprises 54 leg&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.13217v1-abstract-full').style.display = 'inline'; document.getElementById('2406.13217v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.13217v1-abstract-full" style="display: none;"> The effectiveness of Large Language Models (LLMs) in legal reasoning is often limited due to the unique legal terminologies and the necessity for highly specialized knowledge. These limitations highlight the need for high-quality data tailored for complex legal reasoning tasks. This paper introduces LEGALSEMI, a benchmark specifically curated for legal scenario analysis. LEGALSEMI comprises 54 legal scenarios, each rigorously annotated by legal experts, based on the comprehensive IRAC (Issue, Rule, Application, Conclusion) framework. In addition, LEGALSEMI is accompanied by a structured knowledge graph (SKG). A series of experiments were conducted to assess the usefulness of LEGALSEMI for IRAC analysis. The experimental results demonstrate the effectiveness of incorporating the SKG for issue identification, rule retrieval, application and conclusion generation using four different LLMs. LEGALSEMI will be publicly available upon acceptance of this paper. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.13217v1-abstract-full').style.display = 'none'; document.getElementById('2406.13217v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.12271</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Agriculture-Vision Challenge 2024 -- The Runner-Up Solution for Agricultural Pattern Recognition via Class Balancing and Model Ensemble </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+W">Wang Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zhiyu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Duan%2C+P">Puhong Duan</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xudong Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+S">Shutao Li</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.12271v1-abstract-short" style="display: inline;"> The Agriculture-Vision Challenge at CVPR 2024 aims at leveraging semantic segmentation models to produce pixel level semantic segmentation labels within regions of interest for multi-modality satellite images. It is one of the most famous and competitive challenges for global researchers to break the boundary between computer vision and agriculture sectors. However, there is a serious class imbala&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.12271v1-abstract-full').style.display = 'inline'; document.getElementById('2406.12271v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.12271v1-abstract-full" style="display: none;"> The Agriculture-Vision Challenge at CVPR 2024 aims at leveraging semantic segmentation models to produce pixel level semantic segmentation labels within regions of interest for multi-modality satellite images. It is one of the most famous and competitive challenges for global researchers to break the boundary between computer vision and agriculture sectors. However, there is a serious class imbalance problem in the agriculture-vision dataset, which hinders the semantic segmentation performance. To solve this problem, firstly, we propose a mosaic data augmentation with a rare class sampling strategy to enrich long-tail class samples. Secondly, we employ an adaptive class weight scheme to suppress the contribution of the common classes while increasing the ones of rare classes. Thirdly, we propose a probability post-process to increase the predicted value of the rare classes. Our methodology achieved a mean Intersection over Union (mIoU) score of 0.547 on the test set, securing second place in this challenge. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.12271v1-abstract-full').style.display = 'none'; document.getElementById('2406.12271v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.03749</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Huang%2C+S">Shuo Huang</a>, <a href="/search/cs?searchtype=author&amp;query=MacLean%2C+W">William MacLean</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoxi Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+A">Anqi Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+L">Lizhen Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+Q">Qiongkai Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zhuang Li</a>, <a href="/search/cs?searchtype=author&amp;query=Yuan%2C+X">Xingliang Yuan</a>, <a href="/search/cs?searchtype=author&amp;query=Haffari%2C+G">Gholamreza Haffari</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.03749v1-abstract-short" style="display: inline;"> Increasing concerns about privacy leakage issues in academia and industry arise when employing NLP models from third-party providers to process sensitive texts. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them.&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.03749v1-abstract-full').style.display = 'inline'; document.getElementById('2406.03749v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.03749v1-abstract-full" style="display: none;"> Increasing concerns about privacy leakage issues in academia and industry arise when employing NLP models from third-party providers to process sensitive texts. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined NAP^2, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works based on differential privacy, which lead to a sharp drop in information utility and unnatural texts, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.03749v1-abstract-full').style.display = 'none'; document.getElementById('2406.03749v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.02237</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Self-Modifying State Modeling for Simultaneous Machine Translation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yu%2C+D">Donglei Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaomian Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yuchen Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yu Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Zong%2C+C">Chengqing Zong</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.02237v1-abstract-short" style="display: inline;"> Simultaneous Machine Translation (SiMT) generates target outputs while receiving stream source inputs and requires a read/write policy to decide whether to wait for the next source token or generate a new target token, whose decisions form a \textit{decision path}. Existing SiMT methods, which learn the policy by exploring various decision paths in training, face inherent limitations. These method&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.02237v1-abstract-full').style.display = 'inline'; document.getElementById('2406.02237v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.02237v1-abstract-full" style="display: none;"> Simultaneous Machine Translation (SiMT) generates target outputs while receiving stream source inputs and requires a read/write policy to decide whether to wait for the next source token or generate a new target token, whose decisions form a \textit{decision path}. Existing SiMT methods, which learn the policy by exploring various decision paths in training, face inherent limitations. These methods not only fail to precisely optimize the policy due to the inability to accurately assess the individual impact of each decision on SiMT performance, but also cannot sufficiently explore all potential paths because of their vast number. Besides, building decision paths requires unidirectional encoders to simulate streaming source inputs, which impairs the translation quality of SiMT models. To solve these issues, we propose \textbf{S}elf-\textbf{M}odifying \textbf{S}tate \textbf{M}odeling (SM$^2$), a novel training paradigm for SiMT task. Without building decision paths, SM$^2$ individually optimizes decisions at each state during training. To precisely optimize the policy, SM$^2$ introduces Self-Modifying process to independently assess and adjust decisions at each state. For sufficient exploration, SM$^2$ proposes Prefix Sampling to efficiently traverse all potential states. Moreover, SM$^2$ ensures compatibility with bidirectional encoders, thus achieving higher translation quality. Experiments show that SM$^2$ outperforms strong baselines. Furthermore, SM$^2$ allows offline machine translation models to acquire SiMT ability with fine-tuning. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.02237v1-abstract-full').style.display = 'none'; document.getElementById('2406.02237v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accept to ACL 2024 main conference. 15 pages, 13 figures, 9 tables</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2405.14905</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report Generation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+K">Kang Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+Z">Zhuoqi Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaolu Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhong%2C+Z">Zhusi Zhong</a>, <a href="/search/cs?searchtype=author&amp;query=Jiao%2C+Z">Zhicheng Jiao</a>, <a href="/search/cs?searchtype=author&amp;query=Baird%2C+G">Grayson Baird</a>, <a href="/search/cs?searchtype=author&amp;query=Bai%2C+H">Harrison Bai</a>, <a href="/search/cs?searchtype=author&amp;query=Miao%2C+Q">Qiguang Miao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.14905v1-abstract-short" style="display: inline;"> The automated generation of imaging reports proves invaluable in alleviating the workload of radiologists. A clinically applicable reports generation algorithm should demonstrate its effectiveness in producing reports that accurately describe radiology findings and attend to patient-specific indications. In this paper, we introduce a novel method, \textbf{S}tructural \textbf{E}ntities extraction a&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.14905v1-abstract-full').style.display = 'inline'; document.getElementById('2405.14905v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.14905v1-abstract-full" style="display: none;"> The automated generation of imaging reports proves invaluable in alleviating the workload of radiologists. A clinically applicable reports generation algorithm should demonstrate its effectiveness in producing reports that accurately describe radiology findings and attend to patient-specific indications. In this paper, we introduce a novel method, \textbf{S}tructural \textbf{E}ntities extraction and patient indications \textbf{I}ncorporation (SEI) for chest X-ray report generation. Specifically, we employ a structural entities extraction (SEE) approach to eliminate presentation-style vocabulary in reports and improve the quality of factual entity sequences. This reduces the noise in the following cross-modal alignment module by aligning X-ray images with factual entity sequences in reports, thereby enhancing the precision of cross-modal alignment and further aiding the model in gradient-free retrieval of similar historical cases. Subsequently, we propose a cross-modal fusion network to integrate information from X-ray images, similar historical cases, and patient-specific indications. This process allows the text decoder to attend to discriminative features of X-ray images, assimilate historical diagnostic information from similar cases, and understand the examination intention of patients. This, in turn, assists in triggering the text decoder to produce high-quality reports. Experiments conducted on MIMIC-CXR validate the superiority of SEI over state-of-the-art approaches on both natural language generation and clinical efficacy metrics. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.14905v1-abstract-full').style.display = 'none'; document.getElementById('2405.14905v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">The code is available at or</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2405.11151</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Multi-scale Information Sharing and Selection Network with Boundary Attention for Polyp Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaolu Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+Z">Zhuoqi Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+K">Kang Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Yunan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Miao%2C+Q">Qiguang Miao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.11151v1-abstract-short" style="display: inline;"> Polyp segmentation for colonoscopy images is of vital importance in clinical practice. It can provide valuable information for colorectal cancer diagnosis and surgery. While existing methods have achieved relatively good performance, polyp segmentation still faces the following challenges: (1) Varying lighting conditions in colonoscopy and differences in polyp locations, sizes, and morphologies. (&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.11151v1-abstract-full').style.display = 'inline'; document.getElementById('2405.11151v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.11151v1-abstract-full" style="display: none;"> Polyp segmentation for colonoscopy images is of vital importance in clinical practice. It can provide valuable information for colorectal cancer diagnosis and surgery. While existing methods have achieved relatively good performance, polyp segmentation still faces the following challenges: (1) Varying lighting conditions in colonoscopy and differences in polyp locations, sizes, and morphologies. (2) The indistinct boundary between polyps and surrounding tissue. To address these challenges, we propose a Multi-scale information sharing and selection network (MISNet) for polyp segmentation task. We design a Selectively Shared Fusion Module (SSFM) to enforce information sharing and active selection between low-level and high-level features, thereby enhancing model&#39;s ability to capture comprehensive information. We then design a Parallel Attention Module (PAM) to enhance model&#39;s attention to boundaries, and a Balancing Weight Module (BWM) to facilitate the continuous refinement of boundary segmentation in the bottom-up process. Experiments on five polyp segmentation datasets demonstrate that MISNet successfully improved the accuracy and clarity of segmentation result, outperforming state-of-the-art methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.11151v1-abstract-full').style.display = 'none'; document.getElementById('2405.11151v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2405.09586</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Factual Serialization Enhancement: A Key Innovation for Chest X-ray Report Generation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+K">Kang Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+Z">Zhuoqi Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+M">Mengmeng Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Jiao%2C+Z">Zhicheng Jiao</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaolu Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Miao%2C+Q">Qiguang Miao</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+K">Kun Xie</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.09586v2-abstract-short" style="display: inline;"> A radiology report comprises presentation-style vocabulary, which ensures clarity and organization, and factual vocabulary, which provides accurate and objective descriptions based on observable findings. While manually writing these reports is time-consuming and labor-intensive, automatic report generation offers a promising alternative. A critical step in this process is to align radiographs wit&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.09586v2-abstract-full').style.display = 'inline'; document.getElementById('2405.09586v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.09586v2-abstract-full" style="display: none;"> A radiology report comprises presentation-style vocabulary, which ensures clarity and organization, and factual vocabulary, which provides accurate and objective descriptions based on observable findings. While manually writing these reports is time-consuming and labor-intensive, automatic report generation offers a promising alternative. A critical step in this process is to align radiographs with their corresponding reports. However, existing methods often rely on complete reports for alignment, overlooking the impact of presentation-style vocabulary. To address this issue, we propose FSE, a two-stage Factual Serialization Enhancement method. In Stage 1, we introduce factuality-guided contrastive learning for visual representation by maximizing the semantic correspondence between radiographs and corresponding factual descriptions. In Stage 2, we present evidence-driven report generation that enhances diagnostic accuracy by integrating insights from similar historical cases structured as factual serialization. Experiments on MIMIC-CXR and IU X-ray datasets across specific and general scenarios demonstrate that FSE outperforms state-of-the-art approaches in both natural language generation and clinical efficacy metrics. Ablation studies further emphasize the positive effects of factual serialization in Stage 1 and Stage 2. The code is available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.09586v2-abstract-full').style.display = 'none'; document.getElementById('2405.09586v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">code is available at</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2405.02957</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+J">Junkai Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+S">Siyu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+M">Meng Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+W">Weitao Li</a>, <a href="/search/cs?searchtype=author&amp;query=Lai%2C+Y">Yunghwei Lai</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xinhui Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+W">Weizhi Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yang Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.02957v1-abstract-short" style="display: inline;"> In this paper, we introduce a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness. All patients, nurses, and doctors are autonomous agents powered by large language models (LLMs). Our central goal is to enable a doctor agent to learn how to treat illness within the simulacrum. To do so, we propose a method called MedAgent-Zero. As the simulacrum can s&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.02957v1-abstract-full').style.display = 'inline'; document.getElementById('2405.02957v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.02957v1-abstract-full" style="display: none;"> In this paper, we introduce a simulacrum of hospital called Agent Hospital that simulates the entire process of treating illness. All patients, nurses, and doctors are autonomous agents powered by large language models (LLMs). Our central goal is to enable a doctor agent to learn how to treat illness within the simulacrum. To do so, we propose a method called MedAgent-Zero. As the simulacrum can simulate disease onset and progression based on knowledge bases and LLMs, doctor agents can keep accumulating experience from both successful and unsuccessful cases. Simulation experiments show that the treatment performance of doctor agents consistently improves on various tasks. More interestingly, the knowledge the doctor agents have acquired in Agent Hospital is applicable to real-world medicare benchmarks. After treating around ten thousand patients (real-world doctors may take over two years), the evolved doctor agent achieves a state-of-the-art accuracy of 93.06% on a subset of the MedQA dataset that covers major respiratory diseases. This work paves the way for advancing the applications of LLM-powered agent techniques in medical scenarios. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.02957v1-abstract-full').style.display = 'none'; document.getElementById('2405.02957v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2404.08433</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1109/ICASSP48485.2024.10446699 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+L">Linhuang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xin Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Ding%2C+F">Fei Ding</a>, <a href="/search/cs?searchtype=author&amp;query=Nakagawa%2C+S">Satoshi Nakagawa</a>, <a href="/search/cs?searchtype=author&amp;query=Ren%2C+F">Fuji Ren</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2404.08433v1-abstract-short" style="display: inline;"> Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a Multi-Scale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2404.08433v1-abstract-full').style.display = 'inline'; document.getElementById('2404.08433v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2404.08433v1-abstract-full" style="display: none;"> Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a Multi-Scale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach&#39;s proficiency in leveraging spatio-temporal information within DFER. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2404.08433v1-abstract-full').style.display = 'none'; document.getElementById('2404.08433v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 April, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> April 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted to 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024: 3015-3019 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2403.06831</a> <span>&nbsp;&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> HDRTransDC: High Dynamic Range Image Reconstruction with Transformer Deformation Convolution </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Shang%2C+S">Shuaikang Shang</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xuejing Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Ming%2C+A">Anlong Ming</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2403.06831v2-abstract-short" style="display: inline;"> High Dynamic Range (HDR) imaging aims to generate an artifact-free HDR image with realistic details by fusing multi-exposure Low Dynamic Range (LDR) images. Caused by large motion and severe under-/over-exposure among input LDR images, HDR imaging suffers from ghosting artifacts and fusion distortions. To address these critical issues, we propose an HDR Transformer Deformation Convolution (HDRTran&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.06831v2-abstract-full').style.display = 'inline'; document.getElementById('2403.06831v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2403.06831v2-abstract-full" style="display: none;"> High Dynamic Range (HDR) imaging aims to generate an artifact-free HDR image with realistic details by fusing multi-exposure Low Dynamic Range (LDR) images. Caused by large motion and severe under-/over-exposure among input LDR images, HDR imaging suffers from ghosting artifacts and fusion distortions. To address these critical issues, we propose an HDR Transformer Deformation Convolution (HDRTransDC) network to generate high-quality HDR images, which consists of the Transformer Deformable Convolution Alignment Module (TDCAM) and the Dynamic Weight Fusion Block (DWFB). To solve the ghosting artifacts, the proposed TDCAM extracts long-distance content similar to the reference feature in the entire non-reference features, which can accurately remove misalignment and fill the content occluded by moving objects. For the purpose of eliminating fusion distortions, we propose DWFB to spatially adaptively select useful information across frames to effectively fuse multi-exposed features. Extensive experiments show that our method quantitatively and qualitatively achieves state-of-the-art performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.06831v2-abstract-full').style.display = 'none'; document.getElementById('2403.06831v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 29 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 11 March, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> March 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">We request to withdraw our manuscript due to identified issues: inaccuracies in the description of a submodule&#39;s composition, principles, and functionality in Section 3.2, and potential problems in metric calculation in Sections 4.2 and 4.3. To prevent the spread of misleading information, we believe it is necessary to temporarily withdraw the manuscript for further research and verification</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2403.01800</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> AtomoVideo: High Fidelity Image-to-Video Generation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Gong%2C+L">Litong Gong</a>, <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+Y">Yiran Zhu</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+W">Weijie Li</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoyang Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+B">Biao Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Ge%2C+T">Tiezheng Ge</a>, <a href="/search/cs?searchtype=author&amp;query=Zheng%2C+B">Bo Zheng</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2403.01800v2-abstract-short" style="display: inline;"> Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training str&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.01800v2-abstract-full').style.display = 'inline'; document.getElementById('2403.01800v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2403.01800v2-abstract-full" style="display: none;"> Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalized models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods, more examples can be found on our project website: <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.01800v2-abstract-full').style.display = 'none'; document.getElementById('2403.01800v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 March, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 4 March, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> March 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Technical report. Page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2402.11178</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> RENOVI: A Benchmark Towards Remediating Norm Violations in Socio-Cultural Conversations </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhan%2C+H">Haolan Zhan</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zhuang Li</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoxi Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+T">Tao Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Hua%2C+Y">Yuncheng Hua</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+L">Lizhen Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Ying%2C+Y">Yi Ying</a>, <a href="/search/cs?searchtype=author&amp;query=Chandra%2C+M+R">Mei Rianto Chandra</a>, <a href="/search/cs?searchtype=author&amp;query=Rosalin%2C+K">Kelly Rosalin</a>, <a href="/search/cs?searchtype=author&amp;query=Jureynolds%2C+J">Jureynolds Jureynolds</a>, <a href="/search/cs?searchtype=author&amp;query=Sharma%2C+S">Suraj Sharma</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+S">Shilin Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+L">Linhao Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Soon%2C+L">Lay-Ki Soon</a>, <a href="/search/cs?searchtype=author&amp;query=Azad%2C+Z+S">Zhaleh Semnani Azad</a>, <a href="/search/cs?searchtype=author&amp;query=Zukerman%2C+I">Ingrid Zukerman</a>, <a href="/search/cs?searchtype=author&amp;query=Haffari%2C+G">Gholamreza Haffari</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2402.11178v1-abstract-short" style="display: inline;"> Norm violations occur when individuals fail to conform to culturally accepted behaviors, which may lead to potential conflicts. Remediating norm violations requires social awareness and cultural sensitivity of the nuances at play. To equip interactive AI systems with a remediation ability, we offer ReNoVi - a large-scale corpus of 9,258 multi-turn dialogues annotated with social norms, as well as&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2402.11178v1-abstract-full').style.display = 'inline'; document.getElementById('2402.11178v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2402.11178v1-abstract-full" style="display: none;"> Norm violations occur when individuals fail to conform to culturally accepted behaviors, which may lead to potential conflicts. Remediating norm violations requires social awareness and cultural sensitivity of the nuances at play. To equip interactive AI systems with a remediation ability, we offer ReNoVi - a large-scale corpus of 9,258 multi-turn dialogues annotated with social norms, as well as define a sequence of tasks to help understand and remediate norm violations step by step. ReNoVi consists of two parts: 512 human-authored dialogues (real data), and 8,746 synthetic conversations generated by ChatGPT through prompt learning. While collecting sufficient human-authored data is costly, synthetic conversations provide suitable amounts of data to help mitigate the scarcity of training data, as well as the chance to assess the alignment between LLMs and humans in the awareness of social norms. We thus harness the power of ChatGPT to generate synthetic training data for our task. To ensure the quality of both human-authored and synthetic data, we follow a quality control protocol during data collection. Our experimental results demonstrate the importance of remediating norm violations in socio-cultural conversations, as well as the improvement in performance obtained from synthetic data. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2402.11178v1-abstract-full').style.display = 'none'; document.getElementById('2402.11178v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 February, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> February 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">work in progress. 15 pages, 7 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2401.17644</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Distributed, Parallel, and Cluster Computing">cs.DC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Performance">cs.PF</span> </div> </div> <p class="title is-5 mathjax"> BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuxin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Y">Yuhan Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zeyu Li</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xueze Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+Z">Zhenheng Tang</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+X">Xin He</a>, <a href="/search/cs?searchtype=author&amp;query=Guo%2C+R">Rui Guo</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+X">Xin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Q">Qiang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+A+C">Amelie Chi Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Chu%2C+X">Xiaowen Chu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2401.17644v3-abstract-short" style="display: inline;"> Serving systems for Large Language Models (LLMs) are often optimized to improve quality of service (QoS) and throughput. However, due to the lack of open-sourced LLM serving workloads, these systems are frequently evaluated under unrealistic workload assumptions. Consequently, performance may degrade when these systems are deployed in real-world scenarios. This work presents BurstGPT, an LLM servi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.17644v3-abstract-full').style.display = 'inline'; document.getElementById('2401.17644v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2401.17644v3-abstract-full" style="display: none;"> Serving systems for Large Language Models (LLMs) are often optimized to improve quality of service (QoS) and throughput. However, due to the lack of open-sourced LLM serving workloads, these systems are frequently evaluated under unrealistic workload assumptions. Consequently, performance may degrade when these systems are deployed in real-world scenarios. This work presents BurstGPT, an LLM serving workload with 5.29 million traces from regional Azure OpenAI GPT services over 121 days. BurstGPT captures realistic LLM serving characteristics through detailed tracing of: (1) Concurrency of requests: It traces burstiness variations of requests in Azure OpenAI GPT services, revealing diversified concurrency patterns in different services and model types. (2) Response Lengths of requests: It traces the auto-regressive serving processes of GPT models, showing statistical relations between requests and their responses. (3) Failures of requests: It traces failures of conversation and API services, showing intensive resource needs and limited resource availability of such services in Azure. Details of the characteristics can serve multiple purposes in LLM serving optimizations, such as system evaluation and trace provisioning. In our demo evaluation with BurstGPT, we observe that frequent variations in BurstGPT reveal declines in efficiency, stability, or reliability in realistic LLM serving. We identify that the generalization of KV cache management and request scheduling optimization is not guaranteed for different workloads, especially when systems are poorly optimized for unrealistic workloads. We have made the dataset publicly available to encourage further research at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.17644v3-abstract-full').style.display = 'none'; document.getElementById('2401.17644v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 31 January, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> January 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2401.01699</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Multimedia">cs.MM</span> </div> </div> <p class="title is-5 mathjax"> WordArt Designer API: User-Driven Artistic Typography Synthesis with Large Language Models on ModelScope </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=He%2C+J">Jun-Yan He</a>, <a href="/search/cs?searchtype=author&amp;query=Cheng%2C+Z">Zhi-Qi Cheng</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+C">Chenyang Li</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+J">Jingdong Sun</a>, <a href="/search/cs?searchtype=author&amp;query=Xiang%2C+W">Wangmeng Xiang</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+Y">Yusen Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+X">Xianhui Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoyang Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Jin%2C+Z">Zengke Jin</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+B">Bin Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Geng%2C+Y">Yifeng Geng</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+X">Xuansong Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+J">Jingren Zhou</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2401.01699v2-abstract-short" style="display: inline;"> This paper introduces the WordArt Designer API, a novel framework for user-driven artistic typography synthesis utilizing Large Language Models (LLMs) on ModelScope. We address the challenge of simplifying artistic typography for non-professionals by offering a dynamic, adaptive, and computationally efficient alternative to traditional rigid templates. Our approach leverages the power of LLMs to u&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.01699v2-abstract-full').style.display = 'inline'; document.getElementById('2401.01699v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2401.01699v2-abstract-full" style="display: none;"> This paper introduces the WordArt Designer API, a novel framework for user-driven artistic typography synthesis utilizing Large Language Models (LLMs) on ModelScope. We address the challenge of simplifying artistic typography for non-professionals by offering a dynamic, adaptive, and computationally efficient alternative to traditional rigid templates. Our approach leverages the power of LLMs to understand and interpret user input, facilitating a more intuitive design process. We demonstrate through various case studies how users can articulate their aesthetic preferences and functional requirements, which the system then translates into unique and creative typographic designs. Our evaluations indicate significant improvements in user satisfaction, design flexibility, and creative expression over existing systems. The WordArt Designer API not only democratizes the art of typography but also opens up new possibilities for personalized digital communication and design. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.01699v2-abstract-full').style.display = 'none'; document.getElementById('2401.01699v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 January, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 3 January, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> January 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Spotlight Paper at the Workshop on Machine Learning for Creativity and Design, 37th Conference on Neural Information Processing Systems (NeurIPS 2023). 5 pages, 5 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2312.05107</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> DreaMoving: A Human Video Generation Framework based on Diffusion Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Feng%2C+M">Mengyang Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+J">Jinlin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+K">Kai Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Yao%2C+Y">Yuan Yao</a>, <a href="/search/cs?searchtype=author&amp;query=Hui%2C+Z">Zheng Hui</a>, <a href="/search/cs?searchtype=author&amp;query=Guo%2C+X">Xiefan Guo</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+X">Xianhui Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Xue%2C+H">Haolan Xue</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+C">Chen Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+X">Xiaowen Li</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+A">Aojie Li</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoyang Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Lei%2C+B">Biwen Lei</a>, <a href="/search/cs?searchtype=author&amp;query=Cui%2C+M">Miaomiao Cui</a>, <a href="/search/cs?searchtype=author&amp;query=Ren%2C+P">Peiran Ren</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+X">Xuansong Xie</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2312.05107v2-abstract-short" style="display: inline;"> In this paper, we present DreaMoving, a diffusion-based controllable video generation framework to produce high-quality customized human videos. Specifically, given target identity and posture sequences, DreaMoving can generate a video of the target identity moving or dancing anywhere driven by the posture sequences. To this end, we propose a Video ControlNet for motion-controlling and a Content G&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2312.05107v2-abstract-full').style.display = 'inline'; document.getElementById('2312.05107v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2312.05107v2-abstract-full" style="display: none;"> In this paper, we present DreaMoving, a diffusion-based controllable video generation framework to produce high-quality customized human videos. Specifically, given target identity and posture sequences, DreaMoving can generate a video of the target identity moving or dancing anywhere driven by the posture sequences. To this end, we propose a Video ControlNet for motion-controlling and a Content Guider for identity preserving. The proposed model is easy to use and can be adapted to most stylized diffusion models to generate diverse results. The project page is available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2312.05107v2-abstract-full').style.display = 'none'; document.getElementById('2312.05107v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 December, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 8 December, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">5 pages, 5 figures, Tech. Report</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.16207</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Quantitative Methods">q-bio.QM</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> ALNSynergy: a graph convolutional network with multi-representation alignment for drug synergy prediction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+X">Xinxing Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+J">Jiachen Li</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiao Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Pei%2C+G">Guojin Pei</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+K">Keyu Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Genke Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Chu%2C+J">Jian Chu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.16207v2-abstract-short" style="display: inline;"> Drug combination refers to the use of two or more drugs to treat a specific disease at the same time. It is currently the mainstream way to treat complex diseases. Compared with single drugs, drug combinations have better efficacy and can better inhibit toxicity and drug resistance. The computational model based on deep learning concatenates the representation of multiple drugs and the correspondi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.16207v2-abstract-full').style.display = 'inline'; document.getElementById('2311.16207v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.16207v2-abstract-full" style="display: none;"> Drug combination refers to the use of two or more drugs to treat a specific disease at the same time. It is currently the mainstream way to treat complex diseases. Compared with single drugs, drug combinations have better efficacy and can better inhibit toxicity and drug resistance. The computational model based on deep learning concatenates the representation of multiple drugs and the corresponding cell line feature as input, and the output is whether the drug combination can have an inhibitory effect on the cell line. However, this strategy of concatenating multiple representations has the following defects: the alignment of drug representation and cell line representation is ignored, resulting in the synergistic relationship not being reflected positionally in the embedding space. Moreover, the alignment measurement function in deep learning cannot be suitable for drug synergy prediction tasks due to differences in input types. Therefore, in this work, we propose ALNSynergy, a graph convolutional network with multi-representation alignment for predicting drug synergy. In the ALNSynergy model, we designed a multi-representation alignment function suitable for the drug synergy prediction task so that the positional relationship between drug representations and cell line representation is reflected in the embedding space. In addition, the vector modulus of drug representations and cell line representation is considered to improve the accuracy of calculation results and accelerate model convergence. Finally, many relevant experiments were run on multiple drug synergy datasets to verify the effectiveness of the above innovative elements and the excellence of the ALNSynergy model. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.16207v2-abstract-full').style.display = 'none'; document.getElementById('2311.16207v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 27 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">9 pages;</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.02884</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Information Theory">cs.IT</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1109/TWC.2023.3330744 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Deep Learning-Empowered Semantic Communication Systems with a Shared Knowledge Base </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yi%2C+P">Peng Yi</a>, <a href="/search/cs?searchtype=author&amp;query=Cao%2C+Y">Yang Cao</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xin Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Liang%2C+Y">Ying-Chang Liang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.02884v1-abstract-short" style="display: inline;"> Deep learning-empowered semantic communication is regarded as a promising candidate for future 6G networks. Although existing semantic communication systems have achieved superior performance compared to traditional methods, the end-to-end architecture adopted by most semantic communication systems is regarded as a black box, leading to the lack of explainability. To tackle this issue, in this pap&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.02884v1-abstract-full').style.display = 'inline'; document.getElementById('2311.02884v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.02884v1-abstract-full" style="display: none;"> Deep learning-empowered semantic communication is regarded as a promising candidate for future 6G networks. Although existing semantic communication systems have achieved superior performance compared to traditional methods, the end-to-end architecture adopted by most semantic communication systems is regarded as a black box, leading to the lack of explainability. To tackle this issue, in this paper, a novel semantic communication system with a shared knowledge base is proposed for text transmissions. Specifically, a textual knowledge base constructed by inherently readable sentences is introduced into our system. With the aid of the shared knowledge base, the proposed system integrates the message and corresponding knowledge from the shared knowledge base to obtain the residual information, which enables the system to transmit fewer symbols without semantic performance degradation. In order to make the proposed system more reliable, the semantic self-information and the source entropy are mathematically defined based on the knowledge base. Furthermore, the knowledge base construction algorithm is developed based on a similarity-comparison method, in which a pre-configured threshold can be leveraged to control the size of the knowledge base. Moreover, the simulation results have demonstrated that the proposed approach outperforms existing baseline methods in terms of transmitted data size and sentence similarity. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.02884v1-abstract-full').style.display = 'none'; document.getElementById('2311.02884v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">14 pages, Journal, accepted by IEEE TWC</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2310.18332</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Graphics">cs.GR</span> </div> </div> <p class="title is-5 mathjax"> WordArt Designer: User-Driven Artistic Typography Synthesis using Large Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=He%2C+J">Jun-Yan He</a>, <a href="/search/cs?searchtype=author&amp;query=Cheng%2C+Z">Zhi-Qi Cheng</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+C">Chenyang Li</a>, <a href="/search/cs?searchtype=author&amp;query=Sun%2C+J">Jingdong Sun</a>, <a href="/search/cs?searchtype=author&amp;query=Xiang%2C+W">Wangmeng Xiang</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+X">Xianhui Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoyang Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Jin%2C+Z">Zengke Jin</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+Y">Yusen Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+B">Bin Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Geng%2C+Y">Yifeng Geng</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+X">Xuansong Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+J">Jingren Zhou</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2310.18332v2-abstract-short" style="display: inline;"> This paper introduces WordArt Designer, a user-driven framework for artistic typography synthesis, relying on the Large Language Model (LLM). The system incorporates four key modules: the LLM Engine, SemTypo, StyTypo, and TexTypo modules. 1) The LLM Engine, empowered by the LLM (e.g., GPT-3.5), interprets user inputs and generates actionable prompts for the other modules, thereby transforming abst&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.18332v2-abstract-full').style.display = 'inline'; document.getElementById('2310.18332v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2310.18332v2-abstract-full" style="display: none;"> This paper introduces WordArt Designer, a user-driven framework for artistic typography synthesis, relying on the Large Language Model (LLM). The system incorporates four key modules: the LLM Engine, SemTypo, StyTypo, and TexTypo modules. 1) The LLM Engine, empowered by the LLM (e.g., GPT-3.5), interprets user inputs and generates actionable prompts for the other modules, thereby transforming abstract concepts into tangible designs. 2) The SemTypo module optimizes font designs using semantic concepts, striking a balance between artistic transformation and readability. 3) Building on the semantic layout provided by the SemTypo module, the StyTypo module creates smooth, refined images. 4) The TexTypo module further enhances the design&#39;s aesthetics through texture rendering, enabling the generation of inventive textured fonts. Notably, WordArt Designer highlights the fusion of generative AI with artistic typography. Experience its capabilities on ModelScope: <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.18332v2-abstract-full').style.display = 'none'; document.getElementById('2310.18332v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 26 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 20 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by EMNLP 2023, 10 pages, 11 figures, 1 table, the system is at</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2310.15930</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> CDSD: Chinese Dysarthria Speech Database </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Sun%2C+M">Mengyi Sun</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+M">Ming Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xinchen Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+S">Shiru Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Du%2C+J">Jun Du</a>, <a href="/search/cs?searchtype=author&amp;query=Yao%2C+D">Dengfeng Yao</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+S">Su-Jing Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2310.15930v1-abstract-short" style="display: inline;"> We present the Chinese Dysarthria Speech Database (CDSD) as a valuable resource for dysarthria research. This database comprises speech data from 24 participants with dysarthria. Among these participants, one recorded an additional 10 hours of speech data, while each recorded one hour, resulting in 34 hours of speech material. To accommodate participants with varying cognitive levels, our text poo&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.15930v1-abstract-full').style.display = 'inline'; document.getElementById('2310.15930v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2310.15930v1-abstract-full" style="display: none;"> We present the Chinese Dysarthria Speech Database (CDSD) as a valuable resource for dysarthria research. This database comprises speech data from 24 participants with dysarthria. Among these participants, one recorded an additional 10 hours of speech data, while each recorded one hour, resulting in 34 hours of speech material. To accommodate participants with varying cognitive levels, our text pool primarily consists of content from the AISHELL-1 dataset and speeches by primary and secondary school students. When participants read these texts, they must use a mobile device or the ZOOM F8n multi-track field recorder to record their speeches. In this paper, we elucidate the data collection and annotation processes and present an approach for establishing a baseline for dysarthric speech recognition. Furthermore, we conducted a speaker-dependent dysarthric speech recognition experiment using an additional 10 hours of speech data from one of our participants. Our research findings indicate that, through extensive data-driven model training, fine-tuning limited quantities of specific individual data yields commendable results in speaker-dependent dysarthric speech recognition. However, we observe significant variations in recognition results among different dysarthric speakers. These insights provide valuable reference points for speaker-dependent dysarthric speech recognition. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.15930v1-abstract-full').style.display = 'none'; document.getElementById('2310.15930v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">9 pages, 3 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2310.14880</a> <span>&nbsp;[<a href="">pdf</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.18653/v1/2023.findings-emnlp.929 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Can ChatGPT Perform Reasoning Using the IRAC Method in Analyzing Legal Scenarios Like a Lawyer? </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoxi Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+L">Lizhen Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Soon%2C+L">Lay-Ki Soon</a>, <a href="/search/cs?searchtype=author&amp;query=Trakic%2C+A">Adnan Trakic</a>, <a href="/search/cs?searchtype=author&amp;query=Zhuo%2C+T+Y">Terry Yue Zhuo</a>, <a href="/search/cs?searchtype=author&amp;query=Emerton%2C+P+C">Patrick Charles Emerton</a>, <a href="/search/cs?searchtype=author&amp;query=Grant%2C+G">Genevieve Grant</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2310.14880v2-abstract-short" style="display: inline;"> Large Language Models (LLMs), such as ChatGPT, have drawn a lot of attentions recently in the legal domain due to its emergent ability to tackle a variety of legal tasks. However, it is still unknown if LLMs are able to analyze a legal case and perform reasoning in the same manner as lawyers. Therefore, we constructed a novel corpus consisting of scenarios pertain to Contract Acts Malaysia and Aus&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.14880v2-abstract-full').style.display = 'inline'; document.getElementById('2310.14880v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2310.14880v2-abstract-full" style="display: none;"> Large Language Models (LLMs), such as ChatGPT, have drawn a lot of attentions recently in the legal domain due to its emergent ability to tackle a variety of legal tasks. However, it is still unknown if LLMs are able to analyze a legal case and perform reasoning in the same manner as lawyers. Therefore, we constructed a novel corpus consisting of scenarios pertain to Contract Acts Malaysia and Australian Social Act for Dependent Child. ChatGPT is applied to perform analysis on the corpus using the IRAC method, which is a framework widely used by legal professionals for organizing legal analysis. Each scenario in the corpus is annotated with a complete IRAC analysis in a semi-structured format so that both machines and legal professionals are able to interpret and understand the annotations. In addition, we conducted the first empirical assessment of ChatGPT for IRAC analysis in order to understand how well it aligns with the analysis of legal professionals. Our experimental results shed lights on possible future research directions to improve alignments between LLMs and legal experts in terms of legal reasoning. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.14880v2-abstract-full').style.display = 'none'; document.getElementById('2310.14880v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 23 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">EMNLP 2023 Findings</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Report number:</span> 2023.findings-emnlp.929 </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> 2023.findings-emnlp.929 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2310.12670</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Distributed, Parallel, and Cluster Computing">cs.DC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Performance">cs.PF</span> </div> </div> <p class="title is-5 mathjax"> Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuxin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xueze Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+S">Shaohuai Shi</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+X">Xin He</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+Z">Zhenheng Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Pan%2C+X">Xinglin Pan</a>, <a href="/search/cs?searchtype=author&amp;query=Zheng%2C+Y">Yang Zheng</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+X">Xiaoyu Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+A+C">Amelie Chi Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=He%2C+B">Bingsheng He</a>, <a href="/search/cs?searchtype=author&amp;query=Chu%2C+X">Xiaowen Chu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2310.12670v4-abstract-short" style="display: inline;"> To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition betw&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.12670v4-abstract-full').style.display = 'inline'; document.getElementById('2310.12670v4-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2310.12670v4-abstract-full" style="display: none;"> To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs). <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.12670v4-abstract-full').style.display = 'none'; document.getElementById('2310.12670v4-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 19 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Fault Tolerance, Checkpoint Optimization, Large Language Model, Foundation Model, Hybrid parallelism</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2310.11178</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> </div> </div> <p class="title is-5 mathjax"> FocDepthFormer: Transformer with LSTM for Depth Estimation from Focus </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xueyang Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Han%2C+F">Fengze Han</a>, <a href="/search/cs?searchtype=author&amp;query=Fayjie%2C+A">Abdur Fayjie</a>, <a href="/search/cs?searchtype=author&amp;query=Gong%2C+D">Dong Gong</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2310.11178v1-abstract-short" style="display: inline;"> Depth estimation from focal stacks is a fundamental computer vision problem that aims to infer depth from focus/defocus cues in the image stacks. Most existing methods tackle this problem by applying convolutional neural networks (CNNs) with 2D or 3D convolutions over a set of fixed stack images to learn features across images and stacks. Their performance is restricted due to the local properties&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.11178v1-abstract-full').style.display = 'inline'; document.getElementById('2310.11178v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2310.11178v1-abstract-full" style="display: none;"> Depth estimation from focal stacks is a fundamental computer vision problem that aims to infer depth from focus/defocus cues in the image stacks. Most existing methods tackle this problem by applying convolutional neural networks (CNNs) with 2D or 3D convolutions over a set of fixed stack images to learn features across images and stacks. Their performance is restricted due to the local properties of the CNNs, and they are constrained to process a fixed number of stacks consistent in train and inference, limiting the generalization to the arbitrary length of stacks. To handle the above limitations, we develop a novel Transformer-based network, FocDepthFormer, composed mainly of a Transformer with an LSTM module and a CNN decoder. The self-attention in Transformer enables learning more informative features via an implicit non-local cross reference. The LSTM module is learned to integrate the representations across the stack with arbitrary images. To directly capture the low-level features of various degrees of focus/defocus, we propose to use multi-scale convolutional kernels in an early-stage encoder. Benefiting from the design with LSTM, our FocDepthFormer can be pre-trained with abundant monocular RGB depth estimation data for visual pattern capturing, alleviating the demand for the hard-to-collect focal stack data. Extensive experiments on various focal stack benchmark datasets show that our model outperforms the state-of-the-art models on multiple metrics. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.11178v1-abstract-full').style.display = 'none'; document.getElementById('2310.11178v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">20 pages, 18 figures, journal paper</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">ACM Class:</span> I.4.9; I.2.10 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2309.15526</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1145/3581783.3612356 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> P2I-NET: Mapping Camera Pose to Image via Adversarial Learning for New View Synthesis in Real Indoor Environments </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xujie Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+K">Kanglin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Duan%2C+J">Jiang Duan</a>, <a href="/search/cs?searchtype=author&amp;query=Gong%2C+Y">Yuanhao Gong</a>, <a href="/search/cs?searchtype=author&amp;query=Qiu%2C+G">Guoping Qiu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2309.15526v1-abstract-short" style="display: inline;"> Given a new $6DoF$ camera pose in an indoor environment, we study the challenging problem of predicting the view from that pose based on a set of reference RGBD views. Existing explicit or implicit 3D geometry construction methods are computationally expensive while those based on learning have predominantly focused on isolated views of object categories with regular geometric structure. Differing&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.15526v1-abstract-full').style.display = 'inline'; document.getElementById('2309.15526v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2309.15526v1-abstract-full" style="display: none;"> Given a new $6DoF$ camera pose in an indoor environment, we study the challenging problem of predicting the view from that pose based on a set of reference RGBD views. Existing explicit or implicit 3D geometry construction methods are computationally expensive while those based on learning have predominantly focused on isolated views of object categories with regular geometric structure. Differing from the traditional \textit{render-inpaint} approach to new view synthesis in the real indoor environment, we propose a conditional generative adversarial neural network (P2I-NET) to directly predict the new view from the given pose. P2I-NET learns the conditional distribution of the images of the environment for establishing the correspondence between the camera pose and its view of the environment, and achieves this through a number of innovative designs in its architecture and training lost function. Two auxiliary discriminator constraints are introduced for enforcing the consistency between the pose of the generated image and that of the corresponding real world image in both the latent feature space and the real world pose space. Additionally a deep convolutional neural network (CNN) is introduced to further reinforce this consistency in the pixel space. We have performed extensive new view synthesis experiments on real indoor datasets. Results show that P2I-NET has superior performance against a number of NeRF based strong baseline models. In particular, we show that P2I-NET is 40 to 100 times faster than these competitor techniques while synthesising similar quality images. Furthermore, we contribute a new publicly available indoor environment dataset containing 22 high resolution RGBD videos where each frame also has accurate camera pose parameters. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.15526v1-abstract-full').style.display = 'none'; document.getElementById('2309.15526v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 September, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2309.11092</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Multimedia">cs.MM</span> </div> </div> <p class="title is-5 mathjax"> Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Luo%2C+A">Anwei Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Cai%2C+R">Rizhao Cai</a>, <a href="/search/cs?searchtype=author&amp;query=Kong%2C+C">Chenqi Kong</a>, <a href="/search/cs?searchtype=author&amp;query=Ju%2C+Y">Yakun Ju</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiangui Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+J">Jiwu Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Kot%2C+A+C">Alex C. Kot</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2309.11092v2-abstract-short" style="display: inline;"> With the rapid progress of generative models, the current challenge in face forgery detection is how to effectively detect realistic manipulated faces from different unseen domains. Though previous studies show that pre-trained Vision Transformer (ViT) based models can achieve some promising results after fully fine-tuning on the Deepfake dataset, their generalization performances are still unsati&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.11092v2-abstract-full').style.display = 'inline'; document.getElementById('2309.11092v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2309.11092v2-abstract-full" style="display: none;"> With the rapid progress of generative models, the current challenge in face forgery detection is how to effectively detect realistic manipulated faces from different unseen domains. Though previous studies show that pre-trained Vision Transformer (ViT) based models can achieve some promising results after fully fine-tuning on the Deepfake dataset, their generalization performances are still unsatisfactory. One possible reason is that fully fine-tuned ViT-based models may disrupt the pre-trained features [1, 2] and overfit to some data-specific patterns [3]. To alleviate this issue, we present a \textbf{F}orgery-aware \textbf{A}daptive \textbf{Vi}sion \textbf{T}ransformer (FA-ViT) under the adaptive learning paradigm, where the parameters in the pre-trained ViT are kept fixed while the designed adaptive modules are optimized to capture forgery features. Specifically, a global adaptive module is designed to model long-range interactions among input tokens, which takes advantage of self-attention mechanism to mine global forgery clues. To further explore essential local forgery clues, a local adaptive module is proposed to expose local inconsistencies by enhancing the local contextual association. In addition, we introduce a fine-grained adaptive learning module that emphasizes the common compact representation of genuine faces through relationship learning in fine-grained pairs, driving these proposed adaptive modules to be aware of fine-grained forgery-aware information. Extensive experiments demonstrate that our FA-ViT achieves state-of-the-arts results in the cross-dataset evaluation, and enhances the robustness against unseen perturbations. Particularly, FA-ViT achieves 93.83\% and 78.32\% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation. The code and trained model have been released at: <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2309.11092v2-abstract-full').style.display = 'none'; document.getElementById('2309.11092v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 20 September, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2308.06405</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> </div> </div> <p class="title is-5 mathjax"> White-box Membership Inference Attacks against Diffusion Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Pang%2C+Y">Yan Pang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+T">Tianhao Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xuhui Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Huai%2C+M">Mengdi Huai</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yang Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2308.06405v3-abstract-short" style="display: inline;"> Diffusion models have begun to overshadow GANs and other generative models in industrial applications due to their superior image generation performance. The complex architecture of these models furnishes an extensive array of attack features. In light of this, we aim to design membership inference attacks (MIAs) catered to diffusion models. We first conduct an exhaustive analysis of existing MIAs&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2308.06405v3-abstract-full').style.display = 'inline'; document.getElementById('2308.06405v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2308.06405v3-abstract-full" style="display: none;"> Diffusion models have begun to overshadow GANs and other generative models in industrial applications due to their superior image generation performance. The complex architecture of these models furnishes an extensive array of attack features. In light of this, we aim to design membership inference attacks (MIAs) catered to diffusion models. We first conduct an exhaustive analysis of existing MIAs on diffusion models, taking into account factors such as black-box/white-box models and the selection of attack features. We found that white-box attacks are highly applicable in real-world scenarios, and the most effective attacks presently are white-box. Departing from earlier research, which employs model loss as the attack feature for white-box MIAs, we employ model gradients in our attack, leveraging the fact that these gradients provide a more profound understanding of model responses to various samples. We subject these models to rigorous testing across a range of parameters, including training steps, sampling frequency, diffusion steps, and data variance. Across all experimental settings, our method consistently demonstrated near-flawless attack performance, with attack success rate approaching 100% and attack AUCROC near 1.0. We also evaluate our attack against common defense mechanisms, and observe our attacks continue to exhibit commendable performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2308.06405v3-abstract-full').style.display = 'none'; document.getElementById('2308.06405v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 11 August, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2307.04427</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="High Energy Astrophysical Phenomena">astro-ph.HE</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Astrophysics of Galaxies">astro-ph.GA</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1126/science.adc9818 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Observation of high-energy neutrinos from the Galactic plane </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Abbasi%2C+R">R. Abbasi</a>, <a href="/search/cs?searchtype=author&amp;query=Ackermann%2C+M">M. Ackermann</a>, <a href="/search/cs?searchtype=author&amp;query=Adams%2C+J">J. Adams</a>, <a href="/search/cs?searchtype=author&amp;query=Aguilar%2C+J+A">J. A. Aguilar</a>, <a href="/search/cs?searchtype=author&amp;query=Ahlers%2C+M">M. Ahlers</a>, <a href="/search/cs?searchtype=author&amp;query=Ahrens%2C+M">M. Ahrens</a>, <a href="/search/cs?searchtype=author&amp;query=Alameddine%2C+J+M">J. M. Alameddine</a>, <a href="/search/cs?searchtype=author&amp;query=Alves%2C+A+A">A. A. Alves Jr.</a>, <a href="/search/cs?searchtype=author&amp;query=Amin%2C+N+M">N. M. Amin</a>, <a href="/search/cs?searchtype=author&amp;query=Andeen%2C+K">K. Andeen</a>, <a href="/search/cs?searchtype=author&amp;query=Anderson%2C+T">T. Anderson</a>, <a href="/search/cs?searchtype=author&amp;query=Anton%2C+G">G. Anton</a>, <a href="/search/cs?searchtype=author&amp;query=Arg%C3%BCelles%2C+C">C. Arg眉elles</a>, <a href="/search/cs?searchtype=author&amp;query=Ashida%2C+Y">Y. Ashida</a>, <a href="/search/cs?searchtype=author&amp;query=Athanasiadou%2C+S">S. Athanasiadou</a>, <a href="/search/cs?searchtype=author&amp;query=Axani%2C+S">S. Axani</a>, <a href="/search/cs?searchtype=author&amp;query=Bai%2C+X">X. Bai</a>, <a href="/search/cs?searchtype=author&amp;query=V.%2C+A+B">A. Balagopal V.</a>, <a href="/search/cs?searchtype=author&amp;query=Barwick%2C+S+W">S. W. Barwick</a>, <a href="/search/cs?searchtype=author&amp;query=Basu%2C+V">V. Basu</a>, <a href="/search/cs?searchtype=author&amp;query=Baur%2C+S">S. Baur</a>, <a href="/search/cs?searchtype=author&amp;query=Bay%2C+R">R. Bay</a>, <a href="/search/cs?searchtype=author&amp;query=Beatty%2C+J+J">J. J. Beatty</a>, <a href="/search/cs?searchtype=author&amp;query=Becker%2C+K+-">K. -H. Becker</a>, <a href="/search/cs?searchtype=author&amp;query=Tjus%2C+J+B">J. Becker Tjus</a> , et al. (364 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2307.04427v1-abstract-short" style="display: inline;"> The origin of high-energy cosmic rays, atomic nuclei that continuously impact Earth&#39;s atmosphere, has been a mystery for over a century. Due to deflection in interstellar magnetic fields, cosmic rays from the Milky Way arrive at Earth from random directions. However, near their sources and during propagation, cosmic rays interact with matter and produce high-energy neutrinos. We search for neutrin&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2307.04427v1-abstract-full').style.display = 'inline'; document.getElementById('2307.04427v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2307.04427v1-abstract-full" style="display: none;"> The origin of high-energy cosmic rays, atomic nuclei that continuously impact Earth&#39;s atmosphere, has been a mystery for over a century. Due to deflection in interstellar magnetic fields, cosmic rays from the Milky Way arrive at Earth from random directions. However, near their sources and during propagation, cosmic rays interact with matter and produce high-energy neutrinos. We search for neutrino emission using machine learning techniques applied to ten years of data from the IceCube Neutrino Observatory. We identify neutrino emission from the Galactic plane at the 4.5$蟽$ level of significance, by comparing diffuse emission models to a background-only hypothesis. The signal is consistent with modeled diffuse emission from the Galactic plane, but could also arise from a population of unresolved point sources. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2307.04427v1-abstract-full').style.display = 'none'; document.getElementById('2307.04427v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 July, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Submitted on May 12th, 2022; Accepted on May 4th, 2023</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> Science 380, 6652, 1338-1343 (2023) </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2306.10359</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> Text-Driven Foley Sound Generation With Latent Diffusion Model </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yuan%2C+Y">Yi Yuan</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+H">Haohe Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+X">Xubo Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiyuan Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+P">Peipei Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Plumbley%2C+M+D">Mark D. Plumbley</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Wenwu Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2306.10359v5-abstract-short" style="display: inline;"> Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2306.10359v5-abstract-full').style.display = 'inline'; document.getElementById('2306.10359v5-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2306.10359v5-abstract-full" style="display: none;"> Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale datasets and fine-tuned to this task via transfer learning using the contrastive language-audio pertaining (CLAP) technique. We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model. Hence, we introduce a trainable layer after the encoder to improve the text embedding produced by the encoder. In addition, we further refine the generated waveform by generating multiple candidate audio clips simultaneously and selecting the best one, which is determined in terms of the similarity score between the embedding of the candidate clips and the embedding of the target text label. Using the proposed method, our system ranks ${1}^{st}$ among the systems submitted to DCASE Challenge 2023 Task 7. The results of the ablation studies illustrate that the proposed techniques significantly improve sound generation performance. The codes for implementing the proposed system are available online. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2306.10359v5-abstract-full').style.display = 'none'; document.getElementById('2306.10359v5-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 September, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 17 June, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Submit to DCASE-workshop 2023, an extension and supersedes the previous technical report arXiv:2305.15905</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.15905</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Multimedia">cs.MM</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7 </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yuan%2C+Y">Yi Yuan</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+H">Haohe Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+X">Xubo Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiyuan Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Plumbley%2C+M+D">Mark D. Plumbley</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Wenwu Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.15905v3-abstract-short" style="display: inline;"> Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry pr&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.15905v3-abstract-full').style.display = 'inline'; document.getElementById('2305.15905v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.15905v3-abstract-full" style="display: none;"> Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry problem, the system first trained with large-scale datasets and then downstreamed into this DCASE task via transfer learning. Through experiments, we found out that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a significant language model, i.e., contrastive language-audio pertaining (CLAP). In addition, we utilize a filtering strategy to further refine the output, i.e. by selecting the best results from the candidate clips generated in terms of the similarity score between the sound and target labels. The overall system achieves a Frechet audio distance (FAD) score of 4.765 on average among all seven different classes, substantially outperforming the baseline system which performs a FAD score of 9.7. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.15905v3-abstract-full').style.display = 'none'; document.getElementById('2305.15905v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 September, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 25 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">DCASE 2023 task 7 technical report, ranked 1st in the challenge</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2304.12489</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> </div> </div> <p class="title is-5 mathjax"> Beyond the Prior Forgery Knowledge: Mining Critical Clues for General Face Forgery Detection </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Luo%2C+A">Anwei Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Kong%2C+C">Chenqi Kong</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+J">Jiwu Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+Y">Yongjian Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiangui Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Kot%2C+A+C">Alex C. Kot</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2304.12489v1-abstract-short" style="display: inline;"> Face forgery detection is essential in combating malicious digital face attacks. Previous methods mainly rely on prior expert knowledge to capture specific forgery clues, such as noise patterns, blending boundaries, and frequency artifacts. However, these methods tend to get trapped in local optima, resulting in limited robustness and generalization capability. To address these issues, we propose&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2304.12489v1-abstract-full').style.display = 'inline'; document.getElementById('2304.12489v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2304.12489v1-abstract-full" style="display: none;"> Face forgery detection is essential in combating malicious digital face attacks. Previous methods mainly rely on prior expert knowledge to capture specific forgery clues, such as noise patterns, blending boundaries, and frequency artifacts. However, these methods tend to get trapped in local optima, resulting in limited robustness and generalization capability. To address these issues, we propose a novel Critical Forgery Mining (CFM) framework, which can be flexibly assembled with various backbones to boost their generalization and robustness performance. Specifically, we first build a fine-grained triplet and suppress specific forgery traces through prior knowledge-agnostic data augmentation. Subsequently, we propose a fine-grained relation learning prototype to mine critical information in forgeries through instance and local similarity-aware losses. Moreover, we design a novel progressive learning controller to guide the model to focus on principal feature components, enabling it to learn critical forgery features in a coarse-to-fine manner. The proposed method achieves state-of-the-art forgery detection performance under various challenging evaluation settings. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2304.12489v1-abstract-full').style.display = 'none'; document.getElementById('2304.12489v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 April, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> April 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2304.12026</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> SocialDial: A Benchmark for Socially-Aware Dialogue Systems </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhan%2C+H">Haolan Zhan</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zhuang Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yufei Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+L">Linhao Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+T">Tao Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoxi Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Hua%2C+Y">Yuncheng Hua</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+L">Lizhen Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Soon%2C+L">Lay-Ki Soon</a>, <a href="/search/cs?searchtype=author&amp;query=Sharma%2C+S">Suraj Sharma</a>, <a href="/search/cs?searchtype=author&amp;query=Zukerman%2C+I">Ingrid Zukerman</a>, <a href="/search/cs?searchtype=author&amp;query=Semnani-Azad%2C+Z">Zhaleh Semnani-Azad</a>, <a href="/search/cs?searchtype=author&amp;query=Haffari%2C+G">Gholamreza Haffari</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2304.12026v1-abstract-short" style="display: inline;"> Dialogue systems have been widely applied in many scenarios and are now more powerful and ubiquitous than ever before. With large neural models and massive available data, current dialogue systems have access to more knowledge than any people in their life. However, current dialogue systems still do not perform at a human level. One major gap between conversational agents and humans lies in their&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2304.12026v1-abstract-full').style.display = 'inline'; document.getElementById('2304.12026v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2304.12026v1-abstract-full" style="display: none;"> Dialogue systems have been widely applied in many scenarios and are now more powerful and ubiquitous than ever before. With large neural models and massive available data, current dialogue systems have access to more knowledge than any people in their life. However, current dialogue systems still do not perform at a human level. One major gap between conversational agents and humans lies in their abilities to be aware of social norms. The development of socially-aware dialogue systems is impeded due to the lack of resources. In this paper, we present the first socially-aware dialogue corpus - SocialDial, based on Chinese social culture. SocialDial consists of two parts: 1,563 multi-turn dialogues between two human speakers with fine-grained labels, and 4,870 synthetic conversations generated by ChatGPT. The human corpus covers five categories of social norms, which have 14 sub-categories in total. Specifically, it contains social factor annotations including social relation, context, social distance, and social norms. However, collecting sufficient socially-aware dialogues is costly. Thus, we harness the power of ChatGPT and devise an ontology-based synthetic data generation framework. This framework is able to generate synthetic data at scale. To ensure the quality of synthetic dialogues, we design several mechanisms for quality control during data collection. Finally, we evaluate our dataset using several pre-trained models, such as BERT and RoBERTa. Comprehensive empirical results based on state-of-the-art neural models demonstrate that modeling of social norms for dialogue systems is a promising research direction. To the best of our knowledge, SocialDial is the first socially-aware dialogue dataset that covers multiple social factors and has fine-grained labels. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2304.12026v1-abstract-full').style.display = 'none'; document.getElementById('2304.12026v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 April, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> April 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by SIGIR 2023</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2304.07486</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Region-Enhanced Feature Learning for Scene Semantic Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xin Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+C">Chaoqun Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+X">Xuejin Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2304.07486v3-abstract-short" style="display: inline;"> Semantic segmentation in complex scenes relies not only on object appearance but also on object location and the surrounding environment. Nonetheless, it is difficult to model long-range context in the format of pairwise point correlations due to the huge computational cost for large-scale point clouds. In this paper, we propose using regions as the intermediate representation of point clouds inst&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2304.07486v3-abstract-full').style.display = 'inline'; document.getElementById('2304.07486v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2304.07486v3-abstract-full" style="display: none;"> Semantic segmentation in complex scenes relies not only on object appearance but also on object location and the surrounding environment. Nonetheless, it is difficult to model long-range context in the format of pairwise point correlations due to the huge computational cost for large-scale point clouds. In this paper, we propose using regions as the intermediate representation of point clouds instead of fine-grained points or voxels to reduce the computational burden. We introduce a novel Region-Enhanced Feature Learning Network (REFL-Net) that leverages region correlations to enhance point feature learning. We design a region-based feature enhancement (RFE) module, which consists of a Semantic-Spatial Region Extraction stage and a Region Dependency Modeling stage. In the first stage, the input points are grouped into a set of regions based on their semantic and spatial proximity. In the second stage, we explore inter-region semantic and spatial relationships by employing a self-attention block on region features and then fuse point features with the region features to obtain more discriminative representations. Our proposed RFE module is plug-and-play and can be integrated with common semantic segmentation backbones. We conduct extensive experiments on ScanNetV2 and S3DIS datasets and evaluate our RFE module with different segmentation backbones. Our REFL-Net achieves 1.8% mIoU gain on ScanNetV2 and 1.7% mIoU gain on S3DIS with negligible computational cost compared with backbone models. Both quantitative and qualitative results show the powerful long-range context modeling ability and strong generalization ability of our REFL-Net. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2304.07486v3-abstract-full').style.display = 'none'; document.getElementById('2304.07486v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 January, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 April, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> April 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by IEEE Transactions on Multimedia 2023</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2304.06972</a> <span>&nbsp;[<a href="">pdf</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Fluid Dynamics">physics.flu-dyn</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Numerical Analysis">math.NA</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1063/5.0155555 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Multi-fidelity prediction of fluid flow and temperature field based on transfer learning using Fourier Neural Operator </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lyu%2C+Y">Yanfang Lyu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+X">Xiaoyu Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Gong%2C+Z">Zhiqiang Gong</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiao Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Yao%2C+W">Wen Yao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2304.06972v1-abstract-short" style="display: inline;"> Data-driven prediction of fluid flow and temperature distribution in marine and aerospace engineering has received extensive research and demonstrated its potential in real-time prediction recently. However, usually large amounts of high-fidelity data are required to describe and accurately predict the complex physical information, while in reality, only limited high-fidelity data is available due&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2304.06972v1-abstract-full').style.display = 'inline'; document.getElementById('2304.06972v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2304.06972v1-abstract-full" style="display: none;"> Data-driven prediction of fluid flow and temperature distribution in marine and aerospace engineering has received extensive research and demonstrated its potential in real-time prediction recently. However, usually large amounts of high-fidelity data are required to describe and accurately predict the complex physical information, while in reality, only limited high-fidelity data is available due to the high experiment/computational cost. Therefore, this work proposes a novel multi-fidelity learning method based on the Fourier Neural Operator by jointing abundant low-fidelity data and limited high-fidelity data under transfer learning paradigm. First, as a resolution-invariant operator, the Fourier Neural Operator is first and gainfully applied to integrate multi-fidelity data directly, which can utilize the scarce high-fidelity data and abundant low-fidelity data simultaneously. Then, the transfer learning framework is developed for the current task by extracting the rich low-fidelity data knowledge to assist high-fidelity modeling training, to further improve data-driven prediction accuracy. Finally, three typical fluid and temperature prediction problems are chosen to validate the accuracy of the proposed multi-fidelity model. The results demonstrate that our proposed method has high effectiveness when compared with other high-fidelity models, and has the high modeling accuracy of 99% for all the selected physical field problems. Significantly, the proposed multi-fidelity learning method has the potential of a simple structure with high precision, which can provide a reference for the construction of the subsequent model. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2304.06972v1-abstract-full').style.display = 'none'; document.getElementById('2304.06972v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 April, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> April 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2303.08682</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> RSFNet: A White-Box Image Retouching Approach using Region-Specific Color Filters </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ouyang%2C+W">Wenqi Ouyang</a>, <a href="/search/cs?searchtype=author&amp;query=Dong%2C+Y">Yi Dong</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoyang Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Ren%2C+P">Peiran Ren</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+X">Xin Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+X">Xuansong Xie</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2303.08682v2-abstract-short" style="display: inline;"> Retouching images is an essential aspect of enhancing the visual appeal of photos. Although users often share common aesthetic preferences, their retouching methods may vary based on their individual preferences. Therefore, there is a need for white-box approaches that produce satisfying results and enable users to conveniently edit their images simultaneously. Recent white-box retouching methods&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2303.08682v2-abstract-full').style.display = 'inline'; document.getElementById('2303.08682v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2303.08682v2-abstract-full" style="display: none;"> Retouching images is an essential aspect of enhancing the visual appeal of photos. Although users often share common aesthetic preferences, their retouching methods may vary based on their individual preferences. Therefore, there is a need for white-box approaches that produce satisfying results and enable users to conveniently edit their images simultaneously. Recent white-box retouching methods rely on cascaded global filters that provide image-level filter arguments but cannot perform fine-grained retouching. In contrast, colorists typically employ a divide-and-conquer approach, performing a series of region-specific fine-grained enhancements when using traditional tools like Davinci Resolve. We draw on this insight to develop a white-box framework for photo retouching using parallel region-specific filters, called RSFNet. Our model generates filter arguments (e.g., saturation, contrast, hue) and attention maps of regions for each filter simultaneously. Instead of cascading filters, RSFNet employs linear summations of filters, allowing for a more diverse range of filter classes that can be trained more easily. Our experiments demonstrate that RSFNet achieves state-of-the-art results, offering satisfying aesthetic appeal and increased user convenience for editable white-box retouching. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2303.08682v2-abstract-full').style.display = 'none'; document.getElementById('2303.08682v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 August, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 March, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> March 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by ICCV 2023</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2301.11709</a> <span>&nbsp;[<a href="">pdf</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.4018/IJCINI.309991 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Semantic Network Model for Sign Language Comprehension </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xinchen Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Yao%2C+D">Dengfeng Yao</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+M">Minghu Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+Y">Yunlong Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+F">Fanshu Li</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2301.11709v1-abstract-short" style="display: inline;"> In this study, the authors propose a computational cognitive model for sign language (SL) perception and comprehension with detailed algorithmic descriptions based on cognitive functionalities in human language processing. The semantic network model (SNM) that represents semantic relations between concepts, it is used as a form of knowledge representation. The proposed model is applied in the comp&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2301.11709v1-abstract-full').style.display = 'inline'; document.getElementById('2301.11709v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2301.11709v1-abstract-full" style="display: none;"> In this study, the authors propose a computational cognitive model for sign language (SL) perception and comprehension with detailed algorithmic descriptions based on cognitive functionalities in human language processing. The semantic network model (SNM) that represents semantic relations between concepts, it is used as a form of knowledge representation. The proposed model is applied in the comprehension of sign language for classifier predicates. The spreading activation search method is initiated by labeling a set of source nodes (e.g. concepts in the semantic network) with weights or &#34;activation&#34; and then iteratively propagating or &#34;spreading&#34; that activation out to other nodes linked to the source nodes. The results demonstrate that the proposed search method improves the performance of sign language comprehension in the SNM. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2301.11709v1-abstract-full').style.display = 'none'; document.getElementById('2301.11709v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 January, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> January 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">19 pages, 6 figures and 1 table</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> Kang, X., Yao, D., Jiang, M., Huang, Y., &amp; Li, F. (2022). Semantic Network Model for Sign Language Comprehension. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 16(1), 1-19 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2212.11613</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> DDColor: Towards Photo-Realistic Image Colorization via Dual Decoders </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiaoyang Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+T">Tao Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Ouyang%2C+W">Wenqi Ouyang</a>, <a href="/search/cs?searchtype=author&amp;query=Ren%2C+P">Peiran Ren</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+L">Lingzhi Li</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+X">Xuansong Xie</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2212.11613v5-abstract-short" style="display: inline;"> Image colorization is a challenging problem due to multi-modal uncertainty and high ill-posedness. Directly training a deep neural network usually leads to incorrect semantic colors and low color richness. While transformer-based methods can deliver better results, they often rely on manually designed priors, suffer from poor generalization ability, and introduce color bleeding effects. To address&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.11613v5-abstract-full').style.display = 'inline'; document.getElementById('2212.11613v5-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2212.11613v5-abstract-full" style="display: none;"> Image colorization is a challenging problem due to multi-modal uncertainty and high ill-posedness. Directly training a deep neural network usually leads to incorrect semantic colors and low color richness. While transformer-based methods can deliver better results, they often rely on manually designed priors, suffer from poor generalization ability, and introduce color bleeding effects. To address these issues, we propose DDColor, an end-to-end method with dual decoders for image colorization. Our approach includes a pixel decoder and a query-based color decoder. The former restores the spatial resolution of the image, while the latter utilizes rich visual features to refine color queries, thus avoiding hand-crafted priors. Our two decoders work together to establish correlations between color and multi-scale semantic representations via cross-attention, significantly alleviating the color bleeding effect. Additionally, a simple yet effective colorfulness loss is introduced to enhance the color richness. Extensive experiments demonstrate that DDColor achieves superior performance to existing state-of-the-art works both quantitatively and qualitatively. The codes and models are publicly available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.11613v5-abstract-full').style.display = 'none'; document.getElementById('2212.11613v5-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 September, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 22 December, 2022; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2022. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ICCV 2023; Code:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2212.02198</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> </div> </div> <p class="title is-5 mathjax"> Rethinking Generative Methods for Image Restoration in Physics-based Vision: A Theoretical Analysis from the Perspective of Information </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xudong Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+H">Haoran Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Wong%2C+M">Man-Leung Wong</a>, <a href="/search/cs?searchtype=author&amp;query=Qin%2C+J">Jing Qin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2212.02198v2-abstract-short" style="display: inline;"> End-to-end generative methods are considered a more promising solution for image restoration in physics-based vision compared with the traditional deconstructive methods based on handcrafted composition models. However, existing generative methods still have plenty of room for improvement in quantitative performance. More crucially, these methods are considered black boxes due to weak interpretabi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.02198v2-abstract-full').style.display = 'inline'; document.getElementById('2212.02198v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2212.02198v2-abstract-full" style="display: none;"> End-to-end generative methods are considered a more promising solution for image restoration in physics-based vision compared with the traditional deconstructive methods based on handcrafted composition models. However, existing generative methods still have plenty of room for improvement in quantitative performance. More crucially, these methods are considered black boxes due to weak interpretability and there is rarely a theory trying to explain their mechanism and learning process. In this study, we try to re-interpret these generative methods for image restoration tasks using information theory. Different from conventional understanding, we analyzed the information flow of these methods and identified three sources of information (extracted high-level information, retained low-level information, and external information that is absent from the source inputs) are involved and optimized respectively in generating the restoration results. We further derived their learning behaviors, optimization objectives, and the corresponding information boundaries by extending the information bottleneck principle. Based on this theoretic framework, we found that many existing generative methods tend to be direct applications of the general models designed for conventional generation tasks, which may suffer from problems including over-invested abstraction processes, inherent details loss, and vanishing gradients or imbalance in training. We analyzed these issues with both intuitive and theoretical explanations and proved them with empirical evidence respectively. Ultimately, we proposed general solutions or ideas to address the above issue and validated these approaches with performance boosts on six datasets of three different image restoration tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.02198v2-abstract-full').style.display = 'none'; document.getElementById('2212.02198v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 December, 2022; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 5 December, 2022; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2022. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2212.01618</a> <span>&nbsp;[<a href="">pdf</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Theory">cs.IT</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Cryptography and Security">cs.CR</span> </div> </div> <p class="title is-5 mathjax"> An Overview of Trust Standards for Communication Networks and Future Digital World </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+H">Huilin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xin Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+T">Tieyan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Lei%2C+Z">Zhongding Lei</a>, <a href="/search/cs?searchtype=author&amp;query=Chu%2C+C">Cheng-Kang Chu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+H">Haiguang Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2212.01618v1-abstract-short" style="display: inline;"> With the development of Information and Communication Technologies, trust has been applied more and more in various scenarios. At the same time, different organizations have published a series of trust frameworks to support the implementation of trust. There are also academic paper discussing about these trust standards, however, most of them only focus on a specific application. Unlike existing w&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.01618v1-abstract-full').style.display = 'inline'; document.getElementById('2212.01618v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2212.01618v1-abstract-full" style="display: none;"> With the development of Information and Communication Technologies, trust has been applied more and more in various scenarios. At the same time, different organizations have published a series of trust frameworks to support the implementation of trust. There are also academic paper discussing about these trust standards, however, most of them only focus on a specific application. Unlike existing works, this paper provides an overview of all current available trust standards related to communication networks and future digital world from several main organizations. To be specific, this paper summarizes and organizes all these trust standards into three layers: trust foundation, trust elements, and trust applications. We then analysis these trust standards and discuss their contribution in a systematic way. We discuss the motivations behind each current in forced standards, analyzes their frameworks and solutions, and presents their role and impact on communication works and future digital world. Finally, we give our suggestions on the trust work that needs to be standardized in future. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.01618v1-abstract-full').style.display = 'none'; document.getElementById('2212.01618v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 December, 2022; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2022. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">7 pages, 3 figures, Magazine paper under review</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2211.14439</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Science and Game Theory">cs.GT</span> </div> </div> <p class="title is-5 mathjax"> Incentive-boosted Federated Crowdsourcing </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiangping Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+G">Guoxian Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+J">Jun Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Guo%2C+W">Wei Guo</a>, <a href="/search/cs?searchtype=author&amp;query=Domeniconi%2C+C">Carlotta Domeniconi</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+J">Jinglin Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2211.14439v2-abstract-short" style="display: inline;"> Crowdsourcing is a favorable computing paradigm for processing computer-hard tasks by harnessing human intelligence. However, generic crowdsourcing systems may lead to privacy-leakage through the sharing of worker data. To tackle this problem, we propose a novel approach, called iFedCrowd (incentive-boosted Federated Crowdsourcing), to manage the privacy and quality of crowdsourcing projects. iFed&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2211.14439v2-abstract-full').style.display = 'inline'; document.getElementById('2211.14439v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2211.14439v2-abstract-full" style="display: none;"> Crowdsourcing is a favorable computing paradigm for processing computer-hard tasks by harnessing human intelligence. However, generic crowdsourcing systems may lead to privacy-leakage through the sharing of worker data. To tackle this problem, we propose a novel approach, called iFedCrowd (incentive-boosted Federated Crowdsourcing), to manage the privacy and quality of crowdsourcing projects. iFedCrowd allows participants to locally process sensitive data and only upload encrypted training models, and then aggregates the model parameters to build a shared server model to protect data privacy. To motivate workers to build a high-quality global model in an efficacy way, we introduce an incentive mechanism that encourages workers to constantly collect fresh data to train accurate client models and boosts the global model training. We model the incentive-based interaction between the crowdsourcing platform and participating workers as a Stackelberg game, in which each side maximizes its own profit. We derive the Nash Equilibrium of the game to find the optimal solutions for the two sides. Experimental results confirm that iFedCrowd can complete secure crowdsourcing projects with high quality and efficiency. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2211.14439v2-abstract-full').style.display = 'none'; document.getElementById('2211.14439v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 February, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 28 November, 2022; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2022. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by the Thirty-Seventh AAAI Conference on Artificial Intelligence(AAAI2023), the new version of the paper that includes all technical appendices and real experiments</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2211.13909</a> <span>&nbsp;[<a href="">pdf</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Human-Computer Interaction">cs.HC</span> </div> </div> <p class="title is-5 mathjax"> The Magic of Slow-to-Fast and Constant: Evaluating Time Perception of Progress Bars by Bayesian Model </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Q">Qihan Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xinyue Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Rau%2C+P+P">Pei-Luen Patrick Rau</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2211.13909v1-abstract-short" style="display: inline;"> Objective: We aimed to use adaptive psychophysics methods, which is a Bayesian Model, to measure users&#39; time perception of various progress bar quantitatively. Background: Progress bar informs users about the status of ongoing processes. Progress bars frequently display nonuniform speed patterns, such as acceleration and deceleration. However, which progress bar is perceived faster remain unclear.&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2211.13909v1-abstract-full').style.display = 'inline'; document.getElementById('2211.13909v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2211.13909v1-abstract-full" style="display: none;"> Objective: We aimed to use adaptive psychophysics methods, which is a Bayesian Model, to measure users&#39; time perception of various progress bar quantitatively. Background: Progress bar informs users about the status of ongoing processes. Progress bars frequently display nonuniform speed patterns, such as acceleration and deceleration. However, which progress bar is perceived faster remain unclear. Methods: We measured the point of subject equality (PSE) of the constant progress bar toward four different 5-second progress bars with a non-constant speed. To measure PSE, in each trial, a constant progress bar and a non-constant progress bar were presented to participants. Participants needed to judge which one is shorter. Based on their choice, the model generated the time duration of constant progress bar in next trial. After 40 trials for each non-constant progress bar, the PSE was calculated by the model. Eye tracking was recorded during the experiment.Results: Our results show that the constant progress bar and speed-up progress bar are perceived to be faster. The anchoring effect fits the results of our study, indicating that the final part of the progress bar is more important for time perception. Moreover, the eye-tracking results indicate that the progress bar is perceived to be slower is related to the overload of cognitive resources.Conclusion: The constant progress bar and speed-up progress bar are perceived as the quickest. Application: The results suggest that UX design can use constant or speed-up progress bar, in order to improve user experience in waiting. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2211.13909v1-abstract-full').style.display = 'none'; document.getElementById('2211.13909v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 November, 2022; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2022. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2210.17291</a> <span>&nbsp;[<a href="">pdf</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Theory">cs.IT</span> </div> </div> <p class="title is-5 mathjax"> SIX-Trust for 6G: Towards a Secure and Trustworthy 6G Network </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yiying Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xin Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+T">Tieyan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+H">Haiguang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Chu%2C+C">Cheng-Kang Chu</a>, <a href="/search/cs?searchtype=author&amp;query=Lei%2C+Z">Zhongding Lei</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2210.17291v1-abstract-short" style="display: inline;"> Recent years have witnessed a digital explosion with the deployment of 5G and proliferation of 5G-enabled innovations. Compared with 5G, 6G is envisioned to achieve much higher performance in terms of latency, data rate, connectivity, energy efficiency, coverage and mobility. To fulfil these expectations, 6G will experience a number of paradigm shifts, such as exploiting new spectrum, applying ubi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2210.17291v1-abstract-full').style.display = 'inline'; document.getElementById('2210.17291v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2210.17291v1-abstract-full" style="display: none;"> Recent years have witnessed a digital explosion with the deployment of 5G and proliferation of 5G-enabled innovations. Compared with 5G, 6G is envisioned to achieve much higher performance in terms of latency, data rate, connectivity, energy efficiency, coverage and mobility. To fulfil these expectations, 6G will experience a number of paradigm shifts, such as exploiting new spectrum, applying ubiquitous ML/AI technologies and building a space-air-ground-sea integrated network. However, these paradigm shifts may lead to numerous new security and privacy issues, which traditional security measures may not be able to deal with. To tackle these issues and build a trustworthy 6G network, we introduce a novel trust framework named as SIX-Trust, which composes of 3 layers: sustainable trust (S-Trust), infrastructure trust (I-Trust) and xenogenesis trust (X-Trust). Each layer plays a different role, and the importance of each layer varies for different application scenarios of 6G. For each layer, we briefly introduce its related enabling technologies, and demonstrate how these technologies can be applied to enhance trust and security of the 6G network. In general, SIX-Trust provides a holistic framework for defining and modeling trust of 6G, which can facilitate establishing a trustworthy 6G network. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2210.17291v1-abstract-full').style.display = 'none'; document.getElementById('2210.17291v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 31 October, 2022; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2022. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">7 pages, 3 figures, under review</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2210.16504</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> A pruning method based on the dissimilarity of angle among channels and filters </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yao%2C+J">Jiayi Yao</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+P">Ping Li</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+X">Xiatao Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuzhe Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2210.16504v1-abstract-short" style="display: inline;"> Convolutional Neural Network (CNN) is more and more widely used in various fileds, and its computation and memory-demand are also increasing significantly. In order to make it applicable to limited conditions such as embedded application, network compression comes out. Among them, researchers pay more attention to network pruning. In this paper, we encode the convolution network to obtain the simi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2210.16504v1-abstract-full').style.display = 'inline'; document.getElementById('2210.16504v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2210.16504v1-abstract-full" style="display: none;"> Convolutional Neural Network (CNN) is more and more widely used in various fileds, and its computation and memory-demand are also increasing significantly. In order to make it applicable to limited conditions such as embedded application, network compression comes out. Among them, researchers pay more attention to network pruning. In this paper, we encode the convolution network to obtain the similarity of different encoding nodes, and evaluate the connectivity-power among convolutional kernels on the basis of the similarity. Then impose different level of penalty according to different connectivity-power. Meanwhile, we propose Channel Pruning base on the Dissimilarity of Angle (DACP). Firstly, we train a sparse model by GL penalty, and impose an angle dissimilarity constraint on the channels and filters of convolutional network to obtain a more sparse structure. Eventually, the effectiveness of our method is demonstrated in the section of experiment. On CIFAR-10, we reduce 66.86% FLOPs on VGG-16 with 93.31% accuracy after pruning, where FLOPs represents the number of floating-point operations per second of the model. Moreover, on ResNet-32, we reduce FLOPs by 58.46%, which makes the accuracy after pruning reach 91.76%. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2210.16504v1-abstract-full').style.display = 'none'; document.getElementById('2210.16504v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 29 October, 2022; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2022. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by ICTAI 2022</span> </p> </li> </ol> <nav class="pagination is-small is-centered breathe-horizontal" role="navigation" aria-label="pagination"> <a href="" class="pagination-previous is-invisible">Previous </a> <a 