start="1"> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.18266</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Systems and Control">eess.SY</span> </div> </div> <p class="title is-5 mathjax"> Wearable intelligent throat enables natural speech in stroke patients with dysarthria </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Tang%2C+C">Chenyu Tang</a>, <a href="/search/eess?searchtype=author&query=Gao%2C+S">Shuo Gao</a>, <a href="/search/eess?searchtype=author&query=Li%2C+C">Cong Li</a>, <a href="/search/eess?searchtype=author&query=Yi%2C+W">Wentian Yi</a>, <a href="/search/eess?searchtype=author&query=Jin%2C+Y">Yuxuan Jin</a>, <a href="/search/eess?searchtype=author&query=Zhai%2C+X">Xiaoxue Zhai</a>, <a href="/search/eess?searchtype=author&query=Lei%2C+S">Sixuan Lei</a>, <a href="/search/eess?searchtype=author&query=Meng%2C+H">Hongbei Meng</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+Z">Zibo Zhang</a>, <a href="/search/eess?searchtype=author&query=Xu%2C+M">Muzi Xu</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+S">Shengbo Wang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xuhang Chen</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+C">Chenxi Wang</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+H">Hongyun Yang</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+N">Ningli Wang</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+W">Wenyu Wang</a>, <a href="/search/eess?searchtype=author&query=Cao%2C+J">Jin Cao</a>, <a href="/search/eess?searchtype=author&query=Feng%2C+X">Xiaodong Feng</a>, <a href="/search/eess?searchtype=author&query=Smielewski%2C+P">Peter Smielewski</a>, <a href="/search/eess?searchtype=author&query=Pan%2C+Y">Yu Pan</a>, <a href="/search/eess?searchtype=author&query=Song%2C+W">Wenhui Song</a>, <a href="/search/eess?searchtype=author&query=Birchall%2C+M">Martin Birchall</a>, <a href="/search/eess?searchtype=author&query=Occhipint%2C+L+G">Luigi G. Occhipint</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.18266v1-abstract-short" style="display: inline;"> Wearable silent speech systems hold significant potential for restoring communication in patients with speech impairments. However, seamless, coherent speech remains elusive, and clinical efficacy is still unproven. Here, we present an AI-driven intelligent throat (IT) system that integrates throat muscle vibrations and carotid pulse signal sensors with large language model (LLM) processing to ena… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.18266v1-abstract-full').style.display = 'inline'; document.getElementById('2411.18266v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.18266v1-abstract-full" style="display: none;"> Wearable silent speech systems hold significant potential for restoring communication in patients with speech impairments. However, seamless, coherent speech remains elusive, and clinical efficacy is still unproven. Here, we present an AI-driven intelligent throat (IT) system that integrates throat muscle vibrations and carotid pulse signal sensors with large language model (LLM) processing to enable fluent, emotionally expressive communication. The system utilizes ultrasensitive textile strain sensors to capture high-quality signals from the neck area and supports token-level processing for real-time, continuous speech decoding, enabling seamless, delay-free communication. In tests with five stroke patients with dysarthria, IT's LLM agents intelligently corrected token errors and enriched sentence-level emotional and logical coherence, achieving low error rates (4.2% word error rate, 2.9% sentence error rate) and a 55% increase in user satisfaction. This work establishes a portable, intuitive communication platform for patients with dysarthria with the potential to be applied broadly across different neurological conditions and in multi-language support systems. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.18266v1-abstract-full').style.display = 'none'; document.getElementById('2411.18266v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">5 figures, 45 references</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.18138</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Yu%2C+W">Wenyi Yu</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+S">Siyin Wang</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+X">Xiaoyu Yang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xianzhao Chen</a>, <a href="/search/eess?searchtype=author&query=Tian%2C+X">Xiaohai Tian</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+J">Jun Zhang</a>, <a href="/search/eess?searchtype=author&query=Sun%2C+G">Guangzhi Sun</a>, <a href="/search/eess?searchtype=author&query=Lu%2C+L">Lu Lu</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+Y">Yuxuan Wang</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+C">Chao Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.18138v1-abstract-short" style="display: inline;"> Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.18138v1-abstract-full').style.display = 'inline'; document.getElementById('2411.18138v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.18138v1-abstract-full" style="display: none;"> Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.18138v1-abstract-full').style.display = 'none'; document.getElementById('2411.18138v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Technical report</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.17100</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Yang%2C+Y">Yifan Yang</a>, <a href="/search/eess?searchtype=author&query=Zhuo%2C+J">Jianheng Zhuo</a>, <a href="/search/eess?searchtype=author&query=Jin%2C+Z">Zengrui Jin</a>, <a href="/search/eess?searchtype=author&query=Ma%2C+Z">Ziyang Ma</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+X">Xiaoyu Yang</a>, <a href="/search/eess?searchtype=author&query=Yao%2C+Z">Zengwei Yao</a>, <a href="/search/eess?searchtype=author&query=Guo%2C+L">Liyong Guo</a>, <a href="/search/eess?searchtype=author&query=Kang%2C+W">Wei Kang</a>, <a href="/search/eess?searchtype=author&query=Kuang%2C+F">Fangjun Kuang</a>, <a href="/search/eess?searchtype=author&query=Lin%2C+L">Long Lin</a>, <a href="/search/eess?searchtype=author&query=Povey%2C+D">Daniel Povey</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.17100v1-abstract-short" style="display: inline;"> Self-supervised learning (SSL) has achieved great success in speech-related tasks, driven by advancements in speech encoder architectures and the expansion of datasets. While Transformer and Conformer architectures have dominated SSL backbones, encoders like Zipformer, which excel in automatic speech recognition (ASR), remain unexplored in SSL. Concurrently, inefficiencies in data processing withi… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.17100v1-abstract-full').style.display = 'inline'; document.getElementById('2411.17100v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.17100v1-abstract-full" style="display: none;"> Self-supervised learning (SSL) has achieved great success in speech-related tasks, driven by advancements in speech encoder architectures and the expansion of datasets. While Transformer and Conformer architectures have dominated SSL backbones, encoders like Zipformer, which excel in automatic speech recognition (ASR), remain unexplored in SSL. Concurrently, inefficiencies in data processing within existing SSL training frameworks, such as fairseq, pose challenges in managing the growing volumes of training data. To address these issues, we propose k2SSL, an open-source framework that offers faster, more memory-efficient, and better-performing self-supervised speech representation learning, with a focus on downstream ASR tasks. The optimized HuBERT and proposed Zipformer-based SSL systems exhibit substantial reductions in both training time and memory usage during SSL training. Experiments on LibriSpeech and Libri-Light demonstrate that Zipformer-based SSL systems significantly outperform comparable HuBERT and WavLM systems, achieving a relative WER reduction on dev-other/test-other of up to 34.8%/32.4% compared to HuBERT Base after supervised fine-tuning, along with a 3.5x pre-training speedup in total GPU hours. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.17100v1-abstract-full').style.display = 'none'; document.getElementById('2411.17100v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Submitted to ICASSP 2025</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.15209</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Quantized symbolic time series approximation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Carson%2C+E">Erin Carson</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xinye Chen</a>, <a href="/search/eess?searchtype=author&query=Kang%2C+C">Cheng Kang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.15209v1-abstract-short" style="display: inline;"> Time series are ubiquitous in numerous science and engineering domains, e.g., signal processing, bioinformatics, and astronomy. Previous work has verified the efficacy of symbolic time series representation in a variety of engineering applications due to its storage efficiency and numerosity reduction. The most recent symbolic aggregate approximation technique, ABBA, has been shown to preserve ess… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.15209v1-abstract-full').style.display = 'inline'; document.getElementById('2411.15209v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.15209v1-abstract-full" style="display: none;"> Time series are ubiquitous in numerous science and engineering domains, e.g., signal processing, bioinformatics, and astronomy. Previous work has verified the efficacy of symbolic time series representation in a variety of engineering applications due to its storage efficiency and numerosity reduction. The most recent symbolic aggregate approximation technique, ABBA, has been shown to preserve essential shape information of time series and improve downstream applications, e.g., neural network inference regarding prediction and anomaly detection in time series. Motivated by the emergence of high-performance hardware which enables efficient computation for low bit-width representations, we present a new quantization-based ABBA symbolic approximation technique, QABBA, which exhibits improved storage efficiency while retaining the original speed and accuracy of symbolic reconstruction. We prove an upper bound for the error arising from quantization and discuss how the number of bits should be chosen to balance this with other errors. An application of QABBA with large language models (LLMs) for time series regression is also presented, and its utility is investigated. By representing the symbolic chain of patterns on time series, QABBA not only avoids the training of embedding from scratch, but also achieves a new state-of-the-art on Monash regression dataset. The symbolic approximation to the time series offers a more efficient way to fine-tune LLMs on the time series regression task which contains various application domains. We further present a set of extensive experiments performed across various well-established datasets to demonstrate the advantages of the QABBA method for symbolic approximation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.15209v1-abstract-full').style.display = 'none'; document.getElementById('2411.15209v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.14837</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> </div> </div> <p class="title is-5 mathjax"> Fast High-Quality Enhanced Imaging Algorithm for Layered Dielectric Targets Based on MMW MIMO-SAR System </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xu Chen</a>, <a href="/search/eess?searchtype=author&query=Yu%2C+G">Guangsheng Yu</a>, <a href="/search/eess?searchtype=author&query=Yuan%2C+Z">Zhian Yuan</a>, <a href="/search/eess?searchtype=author&query=Wu%2C+H">Hao Wu</a>, <a href="/search/eess?searchtype=author&query=Jiang%2C+Y">Yilin Jiang</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+Y">Ying Wang</a>, <a href="/search/eess?searchtype=author&query=Deng%2C+B">Bin Deng</a>, <a href="/search/eess?searchtype=author&query=Guo%2C+L">Limin Guo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.14837v1-abstract-short" style="display: inline;"> Millimeter-wave (MMW) multiple-input multiple-output synthetic aperture radar (MIMO-SAR) system is a technology that can achieve high resolution, high frame rate, and all-weather imaging and has received extensive attention in the non-destructive testing and internal imaging applications of layered dielectric targets. However, the non-ideal scattering effect caused by dielectric materials can sign… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14837v1-abstract-full').style.display = 'inline'; document.getElementById('2411.14837v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.14837v1-abstract-full" style="display: none;"> Millimeter-wave (MMW) multiple-input multiple-output synthetic aperture radar (MIMO-SAR) system is a technology that can achieve high resolution, high frame rate, and all-weather imaging and has received extensive attention in the non-destructive testing and internal imaging applications of layered dielectric targets. However, the non-ideal scattering effect caused by dielectric materials can significantly deteriorate the imaging quality when using the existing MIMO-SAR fast algorithms. This paper proposes a rapid, high-quality dielectric target-enhanced imaging algorithm for a new universal non-uniform MIMO-SAR system. The algorithm builds on the existing non-uniform MIMO-SAR dielectric target frequency-domain algorithm (DT-FDA) by constructing a forward sensing operator and incorporating it into the alternating direction method of multipliers (ADMM) framework. This approach avoids large matrix operations while maintaining computational efficiency. By integrating an optimal regularization parameter search, the algorithm enhances the image reconstruction quality of dielectric internal structures or defects. Experimental results show the proposed algorithm outperforms IBP and DT-FDA, achieving better focusing, sidelobe suppression, and 3D imaging accuracy. It yields the lowest image entropy (8.864) and significantly improves efficiency (imaging time: 15.29 s vs. 23295.3 s for IBP). <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.14837v1-abstract-full').style.display = 'none'; document.getElementById('2411.14837v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">8 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.13288</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> </div> </div> <p class="title is-5 mathjax"> EEG Signal Denoising Using pix2pix GAN: Enhancing Neurological Data Analysis </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Wang%2C+H">Haoyi Wang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xufang Chen</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+Y">Yue Yang</a>, <a href="/search/eess?searchtype=author&query=Zhou%2C+K">Kewei Zhou</a>, <a href="/search/eess?searchtype=author&query=Lv%2C+M">Meining Lv</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+D">Dongrui Wang</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+W">Wenjie Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.13288v1-abstract-short" style="display: inline;"> Electroencephalography (EEG) is essential in neuroscience and clinical practice, yet it suffers from physiological artifacts, particularly electromyography (EMG), which distort signals. We propose a deep learning model using pix2pixGAN to remove such noise and generate reliable EEG signals. Leveraging the EEGdenoiseNet dataset, we created synthetic datasets with controlled EMG noise levels for mod… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13288v1-abstract-full').style.display = 'inline'; document.getElementById('2411.13288v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.13288v1-abstract-full" style="display: none;"> Electroencephalography (EEG) is essential in neuroscience and clinical practice, yet it suffers from physiological artifacts, particularly electromyography (EMG), which distort signals. We propose a deep learning model using pix2pixGAN to remove such noise and generate reliable EEG signals. Leveraging the EEGdenoiseNet dataset, we created synthetic datasets with controlled EMG noise levels for model training and testing across a signal-to-noise ratio (SNR) from -7 to 2. Our evaluation metrics included RRMSE and Pearson's CC, assessing both time and frequency domains, and compared our model with others. The pix2pixGAN model excelled, especially under high noise conditions, showing significant improvements in lower RRMSE and higher CC values. This demonstrates the model's superior accuracy and stability in purifying EEG signals, offering a robust solution for EEG analysis challenges and advancing clinical and neuroscience applications. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.13288v1-abstract-full').style.display = 'none'; document.getElementById('2411.13288v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">17 pages,6 figures</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">MSC Class:</span> I.4.9 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.12682</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Systems and Control">eess.SY</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Optimization and Control">math.OC</span> </div> </div> <p class="title is-5 mathjax"> Distributed Coordination of Grid-Forming and Grid-Following Inverter-Based Resources for Optimal Frequency Control in Power Systems </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Wang%2C+X">Xiaoyang Wang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xin Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.12682v1-abstract-short" style="display: inline;"> With the fast-growing penetration of power inverter-interfaced renewable generation, power systems face significant challenges in maintaining power balance and the nominal frequency. This paper studies the grid-level coordinated control of a mix of grid-forming (GFM) and grid-following (GFL) inverter-based resources (IBRs) for power system frequency regulation at scale. Specifically, a fully distr… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12682v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12682v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12682v1-abstract-full" style="display: none;"> With the fast-growing penetration of power inverter-interfaced renewable generation, power systems face significant challenges in maintaining power balance and the nominal frequency. This paper studies the grid-level coordinated control of a mix of grid-forming (GFM) and grid-following (GFL) inverter-based resources (IBRs) for power system frequency regulation at scale. Specifically, a fully distributed optimal frequency control algorithm is proposed by leveraging the projected primal-dual gradient method and the structure of the physical system dynamics. This algorithm 1) restores the nominal frequency, 2) minimizes the total control cost, 3) respects the IBR power limits and the line thermal constraints, and 4) is implemented in a distributed fashion that only needs local measurement and local communication. The effectiveness and optimality of the proposed algorithm are demonstrated through high-fidelity electromagnetic transient (EMT) simulations on the IEEE 39-bus system. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12682v1-abstract-full').style.display = 'none'; document.getElementById('2411.12682v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.12547</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> S3TU-Net: Structured Convolution and Superpixel Transformer for Lung Nodule Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Wu%2C+Y">Yuke Wu</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+X">Xiang Liu</a>, <a href="/search/eess?searchtype=author&query=Shi%2C+Y">Yunyu Shi</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xinyi Chen</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+Z">Zhenglei Wang</a>, <a href="/search/eess?searchtype=author&query=Xu%2C+Y">YuQing Xu</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+S+H">Shuo Hong Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.12547v1-abstract-short" style="display: inline;"> The irregular and challenging characteristics of lung adenocarcinoma nodules in computed tomography (CT) images complicate staging diagnosis, making accurate segmentation critical for clinicians to extract detailed lesion information. In this study, we propose a segmentation model, S3TU-Net, which integrates multi-dimensional spatial connectors and a superpixel-based visual transformer. S3TU-Net i… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12547v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12547v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12547v1-abstract-full" style="display: none;"> The irregular and challenging characteristics of lung adenocarcinoma nodules in computed tomography (CT) images complicate staging diagnosis, making accurate segmentation critical for clinicians to extract detailed lesion information. In this study, we propose a segmentation model, S3TU-Net, which integrates multi-dimensional spatial connectors and a superpixel-based visual transformer. S3TU-Net is built on a multi-view CNN-Transformer hybrid architecture, incorporating superpixel algorithms, structured weighting, and spatial shifting techniques to achieve superior segmentation performance. The model leverages structured convolution blocks (DWF-Conv/D2BR-Conv) to extract multi-scale local features while mitigating overfitting. To enhance multi-scale feature fusion, we introduce the S2-MLP Link, integrating spatial shifting and attention mechanisms at the skip connections. Additionally, the residual-based superpixel visual transformer (RM-SViT) effectively merges global and local features by employing sparse correlation learning and multi-branch attention to capture long-range dependencies, with residual connections enhancing stability and computational efficiency. Experimental results on the LIDC-IDRI dataset demonstrate that S3TU-Net achieves a DSC, precision, and IoU of 89.04%, 90.73%, and 90.70%, respectively. Compared to recent methods, S3TU-Net improves DSC by 4.52% and sensitivity by 3.16%, with other metrics showing an approximate 2% increase. In addition to comparison and ablation studies, we validated the generalization ability of our model on the EPDB private dataset, achieving a DSC of 86.40%. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12547v1-abstract-full').style.display = 'none'; document.getElementById('2411.12547v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.11980</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Systems and Control">eess.SY</span> </div> </div> <p class="title is-5 mathjax"> Transmission Line Outage Probability Prediction Under Extreme Events Using Peter-Clark Bayesian Structural Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiaolin Chen</a>, <a href="/search/eess?searchtype=author&query=Huang%2C+Q">Qiuhua Huang</a>, <a href="/search/eess?searchtype=author&query=Zhou%2C+Y">Yuqi Zhou</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.11980v1-abstract-short" style="display: inline;"> Recent years have seen a notable increase in the frequency and intensity of extreme weather events. With a rising number of power outages caused by these events, accurate prediction of power line outages is essential for safe and reliable operation of power grids. The Bayesian network is a probabilistic model that is very effective for predicting line outages under weather-related uncertainties. H… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11980v1-abstract-full').style.display = 'inline'; document.getElementById('2411.11980v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.11980v1-abstract-full" style="display: none;"> Recent years have seen a notable increase in the frequency and intensity of extreme weather events. With a rising number of power outages caused by these events, accurate prediction of power line outages is essential for safe and reliable operation of power grids. The Bayesian network is a probabilistic model that is very effective for predicting line outages under weather-related uncertainties. However, most existing studies in this area offer general risk assessments, but fall short of providing specific outage probabilities. In this work, we introduce a novel approach for predicting transmission line outage probabilities using a Bayesian network combined with Peter-Clark (PC) structural learning. Our approach not only enables precise outage probability calculations, but also demonstrates better scalability and robust performance, even with limited data. Case studies using data from BPA and NOAA show the effectiveness of this approach, while comparisons with several existing methods further highlight its advantages. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11980v1-abstract-full').style.display = 'none'; document.getElementById('2411.11980v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.11879</a> <span> [<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Human-Computer Interaction">cs.HC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1016/j.knosys.2024.112668 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> CSP-Net: Common Spatial Pattern Empowered Neural Networks for EEG-Based Motor Imagery Classification </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Jiang%2C+X">Xue Jiang</a>, <a href="/search/eess?searchtype=author&query=Meng%2C+L">Lubin Meng</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xinru Chen</a>, <a href="/search/eess?searchtype=author&query=Xu%2C+Y">Yifan Xu</a>, <a href="/search/eess?searchtype=author&query=Wu%2C+D">Dongrui Wu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.11879v1-abstract-short" style="display: inline;"> Electroencephalogram-based motor imagery (MI) classification is an important paradigm of non-invasive brain-computer interfaces. Common spatial pattern (CSP), which exploits different energy distributions on the scalp while performing different MI tasks, is very popular in MI classification. Convolutional neural networks (CNNs) have also achieved great success, due to their powerful learning capab… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11879v1-abstract-full').style.display = 'inline'; document.getElementById('2411.11879v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.11879v1-abstract-full" style="display: none;"> Electroencephalogram-based motor imagery (MI) classification is an important paradigm of non-invasive brain-computer interfaces. Common spatial pattern (CSP), which exploits different energy distributions on the scalp while performing different MI tasks, is very popular in MI classification. Convolutional neural networks (CNNs) have also achieved great success, due to their powerful learning capabilities. This paper proposes two CSP-empowered neural networks (CSP-Nets), which integrate knowledge-driven CSP filters with data-driven CNNs to enhance the performance in MI classification. CSP-Net-1 directly adds a CSP layer before a CNN to improve the input discriminability. CSP-Net-2 replaces a convolutional layer in CNN with a CSP layer. The CSP layer parameters in both CSP-Nets are initialized with CSP filters designed from the training data. During training, they can either be kept fixed or optimized using gradient descent. Experiments on four public MI datasets demonstrated that the two CSP-Nets consistently improved over their CNN backbones, in both within-subject and cross-subject classifications. They are particularly useful when the number of training samples is very small. Our work demonstrates the advantage of integrating knowledge-driven traditional machine learning with data-driven deep learning in EEG-based brain-computer interfaces. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11879v1-abstract-full').style.display = 'none'; document.getElementById('2411.11879v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> Knowledge Based Systems, 305:112668, 2024 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.11030</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> </div> </div> <p class="title is-5 mathjax"> IREE Oriented Active RIS-Assisted Green communication System with Outdated CSI </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Cao%2C+K">Kai Cao</a>, <a href="/search/eess?searchtype=author&query=Yu%2C+T">Tao Yu</a>, <a href="/search/eess?searchtype=author&query=Li%2C+J">Jihong Li</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiaojing Chen</a>, <a href="/search/eess?searchtype=author&query=Sun%2C+Y">Yanzan Sun</a>, <a href="/search/eess?searchtype=author&query=Wu%2C+Q">Qingqing Wu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+W">Wen Chen</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+S">Shunqing Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.11030v1-abstract-short" style="display: inline;"> The rapid evolution of communication technologies has spurred a growing demand for energy-efficient network architectures and performance metrics. Active Reconfigurable Intelligent Surfaces (RIS) are emerging as a key component in green network architectures. Compared to passive RIS, active RIS are equipped with amplifiers on each reflecting element, allowing them to simultaneously reflect and amp… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11030v1-abstract-full').style.display = 'inline'; document.getElementById('2411.11030v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.11030v1-abstract-full" style="display: none;"> The rapid evolution of communication technologies has spurred a growing demand for energy-efficient network architectures and performance metrics. Active Reconfigurable Intelligent Surfaces (RIS) are emerging as a key component in green network architectures. Compared to passive RIS, active RIS are equipped with amplifiers on each reflecting element, allowing them to simultaneously reflect and amplify signals, thereby overcoming the double multiplicative fading in the phase response, and improving both system coverage and performance. Additionally, the Integrated Relative Energy Efficiency (IREE) metric, as introduced in [1], addresses the dynamic variations in traffic and capacity over time and space, enabling more energy-efficient wireless systems. Building on these advancements, this paper investigates the problem of maximizing IREE in active RIS-assisted green communication systems. However, acquiring perfect Channel State Information (CSI) in practical systems poses significant challenges and costs. To address this, we derive the average achievable rate based on outdated CSI and formulated the corresponding IREE maximization problem, which is solved by jointly optimizing beamforming at both the base station and RIS. Given the non-convex nature of the problem, we propose an Alternating Optimization Successive Approximation (AOSO) algorithm. By applying quadratic transform and relaxation techniques, we simplify the original problem and alternately optimize the beamforming matrices at the base station and RIS. Furthermore, to handle the discrete constraints of the RIS reflection coefficients, we develop a successive approximation method. Experimental results validate our theoretical analysis of the algorithm's convergence , demonstrating the effectiveness of the proposed algorithm and highlighting the superiority of IREE in enhancing the performance of green communication networks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.11030v1-abstract-full').style.display = 'none'; document.getElementById('2411.11030v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.10004</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> EyeDiff: text-to-image diffusion model improves rare eye disease diagnosis </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+R">Ruoyu Chen</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+W">Weiyi Zhang</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+B">Bowen Liu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiaolan Chen</a>, <a href="/search/eess?searchtype=author&query=Xu%2C+P">Pusheng Xu</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+S">Shunming Liu</a>, <a href="/search/eess?searchtype=author&query=He%2C+M">Mingguang He</a>, <a href="/search/eess?searchtype=author&query=Shi%2C+D">Danli Shi</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.10004v1-abstract-short" style="display: inline;"> The rising prevalence of vision-threatening retinal diseases poses a significant burden on the global healthcare systems. Deep learning (DL) offers a promising solution for automatic disease screening but demands substantial data. Collecting and labeling large volumes of ophthalmic images across various modalities encounters several real-world challenges, especially for rare diseases. Here, we int… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10004v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10004v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10004v1-abstract-full" style="display: none;"> The rising prevalence of vision-threatening retinal diseases poses a significant burden on the global healthcare systems. Deep learning (DL) offers a promising solution for automatic disease screening but demands substantial data. Collecting and labeling large volumes of ophthalmic images across various modalities encounters several real-world challenges, especially for rare diseases. Here, we introduce EyeDiff, a text-to-image model designed to generate multimodal ophthalmic images from natural language prompts and evaluate its applicability in diagnosing common and rare diseases. EyeDiff is trained on eight large-scale datasets using the advanced latent diffusion model, covering 14 ophthalmic image modalities and over 80 ocular diseases, and is adapted to ten multi-country external datasets. The generated images accurately capture essential lesional characteristics, achieving high alignment with text prompts as evaluated by objective metrics and human experts. Furthermore, integrating generated images significantly enhances the accuracy of detecting minority classes and rare eye diseases, surpassing traditional oversampling methods in addressing data imbalance. EyeDiff effectively tackles the issue of data imbalance and insufficiency typically encountered in rare diseases and addresses the challenges of collecting large-scale annotated images, offering a transformative solution to enhance the development of expert-level diseases diagnosis models in ophthalmic field. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10004v1-abstract-full').style.display = 'none'; document.getElementById('2411.10004v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">28 pages, 2 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.09956</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Systems and Control">eess.SY</span> </div> </div> <p class="title is-5 mathjax"> A Secure Estimator with Gaussian Bernoulli Mixture Model </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xingzhou Chen</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+N">Nachuan Yang</a>, <a href="/search/eess?searchtype=author&query=Duan%2C+P">Peihu Duan</a>, <a href="/search/eess?searchtype=author&query=Li%2C+S">Shilei Li</a>, <a href="/search/eess?searchtype=author&query=Shi%2C+L">Ling Shi</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.09956v1-abstract-short" style="display: inline;"> The implementation of cyber-physical systems in real-world applications is challenged by safety requirements in the presence of sensor threats. Most cyber-physical systems, in particular the vulnerable multi-sensor systems, struggle to detect the attack in observation signals. In this paper, we tackle this issue by proposing a Gaussian-Bernoulli Secure (GBS) estimator, which effectively transforms… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09956v1-abstract-full').style.display = 'inline'; document.getElementById('2411.09956v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.09956v1-abstract-full" style="display: none;"> The implementation of cyber-physical systems in real-world applications is challenged by safety requirements in the presence of sensor threats. Most cyber-physical systems, in particular the vulnerable multi-sensor systems, struggle to detect the attack in observation signals. In this paper, we tackle this issue by proposing a Gaussian-Bernoulli Secure (GBS) estimator, which effectively transforms the assessment of sensor status into an optimal estimation problem concerning the system state and observation indicators. It encompasses two theoretical sub-problems: sequential state estimation with partial observations and estimation updates with disordered new observations. Within the framework of Kalman filter, we derive closed-form solutions for these two issues. However, due to their computational inefficiency, we propose the iterative approach employing proximal gradient descent to accelerate the estimation update. We conduct comprehensive experiments from three perspectives: computational efficiency, detection and estimation performance, and characterization of observation error. Our GBS estimator shows the improvements compared to other methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09956v1-abstract-full').style.display = 'none'; document.getElementById('2411.09956v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.08742</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Wang%2C+D">Dingdong Wang</a>, <a href="/search/eess?searchtype=author&query=Cui%2C+M">Mingyu Cui</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+D">Dongchao Yang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xueyuan Chen</a>, <a href="/search/eess?searchtype=author&query=Meng%2C+H">Helen Meng</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.08742v1-abstract-short" style="display: inline;"> With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.08742v1-abstract-full').style.display = 'inline'; document.getElementById('2411.08742v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.08742v1-abstract-full" style="display: none;"> With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored. In this paper, we present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM (Qwen1.5-0.5B). Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding. Moreover, this study goes beyond surface-level comparison by identifying key factors behind the under-performance of discrete tokens, such as limited token granularity and inefficient information retention. To enhance the performance of discrete tokens, we explore potential aspects based on our analysis. We hope our results can offer new insights into the opportunities for advancing discrete speech tokens in Speech LLMs. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.08742v1-abstract-full').style.display = 'none'; document.getElementById('2411.08742v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">5 tables, 4 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.08570</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> </div> </div> <p class="title is-5 mathjax"> Electromagnetic Modeling and Capacity Analysis of Rydberg Atom-Based MIMO System </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Yuan%2C+S+S+A">Shuai S. A. Yuan</a>, <a href="/search/eess?searchtype=author&query=Xu%2C+X+Y+I">Xinyi Y. I. Xu</a>, <a href="/search/eess?searchtype=author&query=Yuan%2C+J">Jinpeng Yuan</a>, <a href="/search/eess?searchtype=author&query=Xie%2C+G">Guoda Xie</a>, <a href="/search/eess?searchtype=author&query=Huang%2C+C">Chongwen Huang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiaoming Chen</a>, <a href="/search/eess?searchtype=author&query=Huang%2C+Z">Zhixiang Huang</a>, <a href="/search/eess?searchtype=author&query=Sha%2C+W+E+I">Wei E. I. Sha</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.08570v1-abstract-short" style="display: inline;"> Rydberg atom-based antennas exploit the quantum properties of highly excited Rydberg atoms, providing unique advantages over classical antennas, such as high sensitivity, broad frequency range, and compact size. Despite the increasing interests in their applications in antenna and communication engineering, two key properties, involving the lack of polarization multiplexing and isotropic reception… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.08570v1-abstract-full').style.display = 'inline'; document.getElementById('2411.08570v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.08570v1-abstract-full" style="display: none;"> Rydberg atom-based antennas exploit the quantum properties of highly excited Rydberg atoms, providing unique advantages over classical antennas, such as high sensitivity, broad frequency range, and compact size. Despite the increasing interests in their applications in antenna and communication engineering, two key properties, involving the lack of polarization multiplexing and isotropic reception without mutual coupling, remain unexplored in the analysis of Rydberg atom-based spatial multiplexing, i.e., multiple-input and multiple-output (MIMO), communications. Generally, the design considerations for any antenna, even for atomic ones, can be extracted to factors such as radiation patterns, efficiency, and polarization, allowing them to be seamlessly integrated into existing system models. In this letter, we extract the antenna properties from relevant quantum characteristics, enabling electromagnetic modeling and capacity analysis of Rydberg MIMO systems in both far-field and near-field scenarios. By employing ray-based method for far-field analysis and dyadic Green's function for near-field calculation, our results indicate that Rydberg atom-based antenna arrays offer specific advantages over classical dipole-type arrays in single-polarization MIMO communications. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.08570v1-abstract-full').style.display = 'none'; document.getElementById('2411.08570v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.07442</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Robotics">cs.RO</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Systems and Control">eess.SY</span> </div> </div> <p class="title is-5 mathjax"> Learned Slip-Detection-Severity Framework using Tactile Deformation Field Feedback for Robotic Manipulation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Jawale%2C+N">Neel Jawale</a>, <a href="/search/eess?searchtype=author&query=Kaur%2C+N">Navneet Kaur</a>, <a href="/search/eess?searchtype=author&query=Santoso%2C+A">Amy Santoso</a>, <a href="/search/eess?searchtype=author&query=Hu%2C+X">Xiaohai Hu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xu Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.07442v1-abstract-short" style="display: inline;"> Safely handling objects and avoiding slippage are fundamental challenges in robotic manipulation, yet traditional techniques often oversimplify the issue by treating slippage as a binary occurrence. Our research presents a framework that both identifies slip incidents and measures their severity. We introduce a set of features based on detailed vector field analysis of tactile deformation data cap… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07442v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07442v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07442v1-abstract-full" style="display: none;"> Safely handling objects and avoiding slippage are fundamental challenges in robotic manipulation, yet traditional techniques often oversimplify the issue by treating slippage as a binary occurrence. Our research presents a framework that both identifies slip incidents and measures their severity. We introduce a set of features based on detailed vector field analysis of tactile deformation data captured by the GelSight Mini sensor. Two distinct machine learning models use these features: one focuses on slip detection, and the other evaluates the slip's severity, which is the slipping velocity of the object against the sensor surface. Our slip detection model achieves an average accuracy of 92%, and the slip severity estimation model exhibits a mean absolute error (MAE) of 0.6 cm/s for unseen objects. To demonstrate the synergistic approach of this framework, we employ both the models in a tactile feedback-guided vertical sliding task. Leveraging the high accuracy of slip detection, we utilize it as the foundational and corrective model and integrate the slip severity estimation into the feedback control loop to address slips without overcompensating. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07442v1-abstract-full').style.display = 'none'; document.getElementById('2411.07442v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted at IROS 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.07111</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> Building a Taiwanese Mandarin Spoken Language Model: A First Attempt </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Yang%2C+C">Chih-Kai Yang</a>, <a href="/search/eess?searchtype=author&query=Fu%2C+Y">Yu-Kuan Fu</a>, <a href="/search/eess?searchtype=author&query=Li%2C+C">Chen-An Li</a>, <a href="/search/eess?searchtype=author&query=Lin%2C+Y">Yi-Cheng Lin</a>, <a href="/search/eess?searchtype=author&query=Lin%2C+Y">Yu-Xiang Lin</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+W">Wei-Chih Chen</a>, <a href="/search/eess?searchtype=author&query=Chung%2C+H+L">Ho Lam Chung</a>, <a href="/search/eess?searchtype=author&query=Kuan%2C+C">Chun-Yi Kuan</a>, <a href="/search/eess?searchtype=author&query=Huang%2C+W">Wei-Ping Huang</a>, <a href="/search/eess?searchtype=author&query=Lu%2C+K">Ke-Han Lu</a>, <a href="/search/eess?searchtype=author&query=Lin%2C+T">Tzu-Quan Lin</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+H">Hsiu-Hsuan Wang</a>, <a href="/search/eess?searchtype=author&query=Hu%2C+E">En-Pei Hu</a>, <a href="/search/eess?searchtype=author&query=Hsu%2C+C">Chan-Jan Hsu</a>, <a href="/search/eess?searchtype=author&query=Tseng%2C+L">Liang-Hsuan Tseng</a>, <a href="/search/eess?searchtype=author&query=Chiu%2C+I">I-Hsiang Chiu</a>, <a href="/search/eess?searchtype=author&query=Sanga%2C+U">Ulin Sanga</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xuanjun Chen</a>, <a href="/search/eess?searchtype=author&query=Hsu%2C+P">Po-chun Hsu</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+S">Shu-wen Yang</a>, <a href="/search/eess?searchtype=author&query=Lee%2C+H">Hung-yi Lee</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.07111v1-abstract-short" style="display: inline;"> This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07111v1-abstract-full').style.display = 'inline'; document.getElementById('2411.07111v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.07111v1-abstract-full" style="display: none;"> This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex capabilities allowing simultaneous speaking and listening. The paper also details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction. We also developed a platform to evaluate conversational fluency and response coherence in multi-turn dialogues. We hope the release of the report can contribute to the future development of spoken LLMs in Taiwanese Mandarin. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.07111v1-abstract-full').style.display = 'none'; document.getElementById('2411.07111v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Work in progress</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.06437</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> CTC-Assisted LLM-Based Contextual ASR </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Yang%2C+G">Guanrou Yang</a>, <a href="/search/eess?searchtype=author&query=Ma%2C+Z">Ziyang Ma</a>, <a href="/search/eess?searchtype=author&query=Gao%2C+Z">Zhifu Gao</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+S">Shiliang Zhang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.06437v1-abstract-short" style="display: inline;"> Contextual ASR or hotword customization holds substantial practical value. Despite the impressive performance of current end-to-end (E2E) automatic speech recognition (ASR) systems, they often face challenges in accurately recognizing rare words. Typical E2E contextual ASR models commonly feature complex architectures and decoding mechanisms, limited in performance and susceptible to interference… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06437v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06437v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06437v1-abstract-full" style="display: none;"> Contextual ASR or hotword customization holds substantial practical value. Despite the impressive performance of current end-to-end (E2E) automatic speech recognition (ASR) systems, they often face challenges in accurately recognizing rare words. Typical E2E contextual ASR models commonly feature complex architectures and decoding mechanisms, limited in performance and susceptible to interference from distractor words. With large language model (LLM)-based ASR models emerging as the new mainstream, we propose a CTC-Assisted LLM-Based Contextual ASR model with an efficient filtering algorithm. By using coarse CTC decoding results to filter potential relevant hotwords and incorporating them into LLM prompt input, our model attains WER/B-WER of 1.27%/3.67% and 2.72%/8.02% on the Librispeech test-clean and test-other sets targeting on recognizing rare long-tail words, demonstrating significant improvements compared to the baseline LLM-based ASR model, and substantially surpassing other related work. More remarkably, with the help of the large language model and proposed filtering algorithm, our contextual ASR model still performs well with 2000 biasing words. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06437v1-abstract-full').style.display = 'none'; document.getElementById('2411.06437v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">SLT 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.05361</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Huang%2C+C">Chien-yu Huang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+W">Wei-Chih Chen</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+S">Shu-wen Yang</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+A+T">Andy T. Liu</a>, <a href="/search/eess?searchtype=author&query=Li%2C+C">Chen-An Li</a>, <a href="/search/eess?searchtype=author&query=Lin%2C+Y">Yu-Xiang Lin</a>, <a href="/search/eess?searchtype=author&query=Tseng%2C+W">Wei-Cheng Tseng</a>, <a href="/search/eess?searchtype=author&query=Diwan%2C+A">Anuj Diwan</a>, <a href="/search/eess?searchtype=author&query=Shih%2C+Y">Yi-Jen Shih</a>, <a href="/search/eess?searchtype=author&query=Shi%2C+J">Jiatong Shi</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+W">William Chen</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xuanjun Chen</a>, <a href="/search/eess?searchtype=author&query=Hsiao%2C+C">Chi-Yuan Hsiao</a>, <a href="/search/eess?searchtype=author&query=Peng%2C+P">Puyuan Peng</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+S">Shih-Heng Wang</a>, <a href="/search/eess?searchtype=author&query=Kuan%2C+C">Chun-Yi Kuan</a>, <a href="/search/eess?searchtype=author&query=Lu%2C+K">Ke-Han Lu</a>, <a href="/search/eess?searchtype=author&query=Chang%2C+K">Kai-Wei Chang</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+C">Chih-Kai Yang</a>, <a href="/search/eess?searchtype=author&query=Ritter-Gutierrez%2C+F">Fabian Ritter-Gutierrez</a>, <a href="/search/eess?searchtype=author&query=Chuang%2C+M+T">Ming To Chuang</a>, <a href="/search/eess?searchtype=author&query=Huang%2C+K">Kuan-Po Huang</a>, <a href="/search/eess?searchtype=author&query=Arora%2C+S">Siddhant Arora</a>, <a href="/search/eess?searchtype=author&query=Lin%2C+Y">You-Kuan Lin</a>, <a href="/search/eess?searchtype=author&query=Yeo%2C+E">Eunjung Yeo</a> , et al. (53 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05361v1-abstract-short" style="display: inline;"> Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluati… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05361v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05361v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05361v1-abstract-full" style="display: none;"> Multimodal foundation models, such as Gemini and ChatGPT, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We will soon open-source all task data and the evaluation pipeline. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05361v1-abstract-full').style.display = 'none'; document.getElementById('2411.05361v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.23738</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> MLLA-UNet: Mamba-like Linear Attention in an Efficient U-Shape Model for Medical Image Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Jiang%2C+Y">Yufeng Jiang</a>, <a href="/search/eess?searchtype=author&query=Li%2C+Z">Zongxi Li</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiangyan Chen</a>, <a href="/search/eess?searchtype=author&query=Xie%2C+H">Haoran Xie</a>, <a href="/search/eess?searchtype=author&query=Cai%2C+J">Jing Cai</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.23738v1-abstract-short" style="display: inline;"> Recent advancements in medical imaging have resulted in more complex and diverse images, with challenges such as high anatomical variability, blurred tissue boundaries, low organ contrast, and noise. Traditional segmentation methods struggle to address these challenges, making deep learning approaches, particularly U-shaped architectures, increasingly prominent. However, the quadratic complexity o… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.23738v1-abstract-full').style.display = 'inline'; document.getElementById('2410.23738v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.23738v1-abstract-full" style="display: none;"> Recent advancements in medical imaging have resulted in more complex and diverse images, with challenges such as high anatomical variability, blurred tissue boundaries, low organ contrast, and noise. Traditional segmentation methods struggle to address these challenges, making deep learning approaches, particularly U-shaped architectures, increasingly prominent. However, the quadratic complexity of standard self-attention makes Transformers computationally prohibitive for high-resolution images. To address these challenges, we propose MLLA-UNet (Mamba-Like Linear Attention UNet), a novel architecture that achieves linear computational complexity while maintaining high segmentation accuracy through its innovative combination of linear attention and Mamba-inspired adaptive mechanisms, complemented by an efficient symmetric sampling structure for enhanced feature processing. Our architecture effectively preserves essential spatial features while capturing long-range dependencies at reduced computational complexity. Additionally, we introduce a novel sampling strategy for multi-scale feature fusion. Experiments demonstrate that MLLA-UNet achieves state-of-the-art performance on six challenging datasets with 24 different segmentation tasks, including but not limited to FLARE22, AMOS CT, and ACDC, with an average DSC of 88.32%. These results underscore the superiority of MLLA-UNet over existing methods. Our contributions include the novel 2D segmentation architecture and its empirical validation. The code is available via <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.23738v1-abstract-full').style.display = 'none'; document.getElementById('2410.23738v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 31 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.22646</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> SleepNetZero: Zero-Burden Zero-Shot Reliable Sleep Staging With Neural Networks Based on Ballistocardiograms </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Li%2C+S">Shuzhen Li</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+Y">Yuxin Chen</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xuesong Chen</a>, <a href="/search/eess?searchtype=author&query=Gao%2C+R">Ruiyang Gao</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+Y">Yupeng Zhang</a>, <a href="/search/eess?searchtype=author&query=Yu%2C+C">Chao Yu</a>, <a href="/search/eess?searchtype=author&query=Li%2C+Y">Yunfei Li</a>, <a href="/search/eess?searchtype=author&query=Ye%2C+Z">Ziyi Ye</a>, <a href="/search/eess?searchtype=author&query=Huang%2C+W">Weijun Huang</a>, <a href="/search/eess?searchtype=author&query=Yi%2C+H">Hongliang Yi</a>, <a href="/search/eess?searchtype=author&query=Leng%2C+Y">Yue Leng</a>, <a href="/search/eess?searchtype=author&query=Wu%2C+Y">Yi Wu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.22646v1-abstract-short" style="display: inline;"> Sleep monitoring plays a crucial role in maintaining good health, with sleep staging serving as an essential metric in the monitoring process. Traditional methods, utilizing medical sensors like EEG and ECG, can be effective but often present challenges such as unnatural user experience, complex deployment, and high costs. Ballistocardiography~(BCG), a type of piezoelectric sensor signal, offers a… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.22646v1-abstract-full').style.display = 'inline'; document.getElementById('2410.22646v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.22646v1-abstract-full" style="display: none;"> Sleep monitoring plays a crucial role in maintaining good health, with sleep staging serving as an essential metric in the monitoring process. Traditional methods, utilizing medical sensors like EEG and ECG, can be effective but often present challenges such as unnatural user experience, complex deployment, and high costs. Ballistocardiography~(BCG), a type of piezoelectric sensor signal, offers a non-invasive, user-friendly, and easily deployable alternative for long-term home monitoring. However, reliable BCG-based sleep staging is challenging due to the limited sleep monitoring data available for BCG. A restricted training dataset prevents the model from generalization across populations. Additionally, transferring to BCG faces difficulty ensuring model robustness when migrating from other data sources. To address these issues, we introduce SleepNetZero, a zero-shot learning based approach for sleep staging. To tackle the generalization challenge, we propose a series of BCG feature extraction methods that align BCG components with corresponding respiratory, cardiac, and movement channels in PSG. This allows models to be trained on large-scale PSG datasets that are diverse in population. For the migration challenge, we employ data augmentation techniques, significantly enhancing generalizability. We conducted extensive training and testing on large datasets~(12393 records from 9637 different subjects), achieving an accuracy of 0.803 and a Cohen's Kappa of 0.718. ZeroSleepNet was also deployed in real prototype~(monitoring pads) and tested in actual hospital settings~(265 users), demonstrating an accuracy of 0.697 and a Cohen's Kappa of 0.589. To the best of our knowledge, this work represents the first known reliable BCG-based sleep staging effort and marks a significant step towards in-home health monitoring. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.22646v1-abstract-full').style.display = 'none'; document.getElementById('2410.22646v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 29 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">25 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.21658</a> <span> [<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Information Theory">cs.IT</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> </div> </div> <p class="title is-5 mathjax"> Exploiting On-Orbit Characteristics for Joint Parameter and Channel Tracking in LEO Satellite Communications </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Lin%2C+C">Chenlan Lin</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiaoming Chen</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+Z">Zhaoyang Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.21658v1-abstract-short" style="display: inline;"> In high-dynamic low earth orbit (LEO) satellite communication (SATCOM) systems, frequent channel state information (CSI) acquisition consumes a large number of pilots, which is intolerable in resource-limited SATCOM systems. To tackle this problem, we propose to track the state-dependent parameters including Doppler shift and channel angles, by exploiting the physical and approximate on-orbit mobi… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.21658v1-abstract-full').style.display = 'inline'; document.getElementById('2410.21658v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.21658v1-abstract-full" style="display: none;"> In high-dynamic low earth orbit (LEO) satellite communication (SATCOM) systems, frequent channel state information (CSI) acquisition consumes a large number of pilots, which is intolerable in resource-limited SATCOM systems. To tackle this problem, we propose to track the state-dependent parameters including Doppler shift and channel angles, by exploiting the physical and approximate on-orbit mobility characteristics for LEO satellite and ground users (GUs), respectively. As a prerequisite for tracking, we formulate the state evolution models for kinematic (state) parameters of both satellite and GUs, along with the measurement models that describe the relationship between the state-dependent parameters and states. Then the rough estimation of state-dependent parameters is initially conducted, which is used as the measurement results in the subsequent state tracking. Concurrently, the measurement error covariance is predicted based on the formulated Cram$\acute{\text{e}}$r-Rao lower bound (CRLB). Finally, with the extended Kalman filter (EKF)-based state tracking as the bridge, the Doppler shift and channel angles can be further updated and the CSI can also be acquired. Simulation results show that compared to the rough estimation methods, the proposed joint parameter and channel tracking (JPCT) algorithm performs much better in the estimation of state-dependent parameters. Moreover, as to the CSI acquisition, the proposed algorithm can utilize a shorter pilot sequence than benchmark methods under a given estimation accuracy. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.21658v1-abstract-full').style.display = 'none'; document.getElementById('2410.21658v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">IEEE Transactions on Wireless Communications, 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.20812</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> </div> </div> <p class="title is-5 mathjax"> Fidelity-Imposed Displacement Editing for the Learn2Reg 2024 SHG-BF Challenge </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Wang%2C+J">Jiacheng Wang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiang Chen</a>, <a href="/search/eess?searchtype=author&query=Hu%2C+R">Renjiu Hu</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+R">Rongguang Wang</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+M">Min Liu</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+Y">Yaonan Wang</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+J">Jiazheng Wang</a>, <a href="/search/eess?searchtype=author&query=Li%2C+H">Hao Li</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+H">Hang Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.20812v1-abstract-short" style="display: inline;"> Co-examination of second-harmonic generation (SHG) and bright-field (BF) microscopy enables the differentiation of tissue components and collagen fibers, aiding the analysis of human breast and pancreatic cancer tissues. However, large discrepancies between SHG and BF images pose challenges for current learning-based registration models in aligning SHG to BF. In this paper, we propose a novel mult… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.20812v1-abstract-full').style.display = 'inline'; document.getElementById('2410.20812v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.20812v1-abstract-full" style="display: none;"> Co-examination of second-harmonic generation (SHG) and bright-field (BF) microscopy enables the differentiation of tissue components and collagen fibers, aiding the analysis of human breast and pancreatic cancer tissues. However, large discrepancies between SHG and BF images pose challenges for current learning-based registration models in aligning SHG to BF. In this paper, we propose a novel multi-modal registration framework that employs fidelity-imposed displacement editing to address these challenges. The framework integrates batch-wise contrastive learning, feature-based pre-alignment, and instance-level optimization. Experimental results from the Learn2Reg COMULISglobe SHG-BF Challenge validate the effectiveness of our method, securing the 1st place on the online leaderboard. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.20812v1-abstract-full').style.display = 'none'; document.getElementById('2410.20812v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.16726</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Yang%2C+G">Guanrou Yang</a>, <a href="/search/eess?searchtype=author&query=Yu%2C+F">Fan Yu</a>, <a href="/search/eess?searchtype=author&query=Ma%2C+Z">Ziyang Ma</a>, <a href="/search/eess?searchtype=author&query=Du%2C+Z">Zhihao Du</a>, <a href="/search/eess?searchtype=author&query=Gao%2C+Z">Zhifu Gao</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+S">Shiliang Zhang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.16726v1-abstract-short" style="display: inline;"> While automatic speech recognition (ASR) systems have achieved remarkable performance with large-scale datasets, their efficacy remains inadequate in low-resource settings, encompassing dialects, accents, minority languages, and long-tail hotwords, domains with significant practical relevance. With the advent of versatile and powerful text-to-speech (TTS) models, capable of generating speech with… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.16726v1-abstract-full').style.display = 'inline'; document.getElementById('2410.16726v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.16726v1-abstract-full" style="display: none;"> While automatic speech recognition (ASR) systems have achieved remarkable performance with large-scale datasets, their efficacy remains inadequate in low-resource settings, encompassing dialects, accents, minority languages, and long-tail hotwords, domains with significant practical relevance. With the advent of versatile and powerful text-to-speech (TTS) models, capable of generating speech with human-level naturalness, expressiveness, and diverse speaker profiles, leveraging TTS for ASR data augmentation provides a cost-effective and practical approach to enhancing ASR performance. Comprehensive experiments on an unprecedentedly rich variety of low-resource datasets demonstrate consistent and substantial performance improvements, proving that the proposed method of enhancing low-resource ASR through a versatile TTS model is highly effective and has broad application prospects. Furthermore, we delve deeper into key characteristics of synthesized speech data that contribute to ASR improvement, examining factors such as text diversity, speaker diversity, and the volume of synthesized data, with text diversity being studied for the first time in this work. We hope our findings provide helpful guidance and reference for the practical application of TTS-based data augmentation and push the advancement of low-resource ASR one step further. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.16726v1-abstract-full').style.display = 'none'; document.getElementById('2410.16726v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.16662</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiaolan Chen</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+R">Ruoyu Chen</a>, <a href="/search/eess?searchtype=author&query=Xu%2C+P">Pusheng Xu</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+W">Weiyi Zhang</a>, <a href="/search/eess?searchtype=author&query=Shang%2C+X">Xianwen Shang</a>, <a href="/search/eess?searchtype=author&query=He%2C+M">Mingguang He</a>, <a href="/search/eess?searchtype=author&query=Shi%2C+D">Danli Shi</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.16662v1-abstract-short" style="display: inline;"> Accurate diagnosis of ophthalmic diseases relies heavily on the interpretation of multimodal ophthalmic images, a process often time-consuming and expertise-dependent. Visual Question Answering (VQA) presents a potential interdisciplinary solution by merging computer vision and natural language processing to comprehend and respond to queries about medical images. This review article explores the r… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.16662v1-abstract-full').style.display = 'inline'; document.getElementById('2410.16662v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.16662v1-abstract-full" style="display: none;"> Accurate diagnosis of ophthalmic diseases relies heavily on the interpretation of multimodal ophthalmic images, a process often time-consuming and expertise-dependent. Visual Question Answering (VQA) presents a potential interdisciplinary solution by merging computer vision and natural language processing to comprehend and respond to queries about medical images. This review article explores the recent advancements and future prospects of VQA in ophthalmology from both theoretical and practical perspectives, aiming to provide eye care professionals with a deeper understanding and tools for leveraging the underlying models. Additionally, we discuss the promising trend of large language models (LLM) in enhancing various components of the VQA framework to adapt to multimodal ophthalmic tasks. Despite the promising outlook, ophthalmic VQA still faces several challenges, including the scarcity of annotated multimodal image datasets, the necessity of comprehensive and unified evaluation methods, and the obstacles to achieving effective real-world applications. This article highlights these challenges and clarifies future directions for advancing ophthalmic VQA with LLMs. The development of LLM-based ophthalmic VQA systems calls for collaborative efforts between medical professionals and AI experts to overcome existing obstacles and advance the diagnosis and care of eye diseases. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.16662v1-abstract-full').style.display = 'none'; document.getElementById('2410.16662v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.15764</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Guo%2C+Y">Yiwei Guo</a>, <a href="/search/eess?searchtype=author&query=Li%2C+Z">Zhihan Li</a>, <a href="/search/eess?searchtype=author&query=Du%2C+C">Chenpeng Du</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+H">Hankun Wang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a>, <a href="/search/eess?searchtype=author&query=Yu%2C+K">Kai Yu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.15764v1-abstract-short" style="display: inline;"> Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a three-stage unsupervised training framework with a speaker pertur… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.15764v1-abstract-full').style.display = 'inline'; document.getElementById('2410.15764v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.15764v1-abstract-full" style="display: none;"> Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a three-stage unsupervised training framework with a speaker perturbation technique. A continuous information bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details from LSCodec. By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. The 25Hz version of LSCodec also achieves the lowest bitrate (0.25kbps) of codecs so far with decent quality. Voice conversion evaluations prove the satisfactory speaker disentanglement of LSCodec, and ablation study further verifies the effectiveness of the proposed training framework. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.15764v1-abstract-full').style.display = 'none'; document.getElementById('2410.15764v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">5 pages, 2 figures, 4 tables. Submitted to ICASSP 2025. Demo page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.11578</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> STA-Unet: Rethink the semantic redundant for Medical Imaging Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Vasa%2C+V+K">Vamsi Krishna Vasa</a>, <a href="/search/eess?searchtype=author&query=Zhu%2C+W">Wenhui Zhu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiwen Chen</a>, <a href="/search/eess?searchtype=author&query=Qiu%2C+P">Peijie Qiu</a>, <a href="/search/eess?searchtype=author&query=Dong%2C+X">Xuanzhao Dong</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+Y">Yalin Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.11578v1-abstract-short" style="display: inline;"> In recent years, significant progress has been made in the medical image analysis domain using convolutional neural networks (CNNs). In particular, deep neural networks based on a U-shaped architecture (UNet) with skip connections have been adopted for several medical imaging tasks, including organ segmentation. Despite their great success, CNNs are not good at learning global or semantic features… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.11578v1-abstract-full').style.display = 'inline'; document.getElementById('2410.11578v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.11578v1-abstract-full" style="display: none;"> In recent years, significant progress has been made in the medical image analysis domain using convolutional neural networks (CNNs). In particular, deep neural networks based on a U-shaped architecture (UNet) with skip connections have been adopted for several medical imaging tasks, including organ segmentation. Despite their great success, CNNs are not good at learning global or semantic features. Especially ones that require human-like reasoning to understand the context. Many UNet architectures attempted to adjust with the introduction of Transformer-based self-attention mechanisms, and notable gains in performance have been noted. However, the transformers are inherently flawed with redundancy to learn at shallow layers, which often leads to an increase in the computation of attention from the nearby pixels offering limited information. The recently introduced Super Token Attention (STA) mechanism adapts the concept of superpixels from pixel space to token space, using super tokens as compact visual representations. This approach tackles the redundancy by learning efficient global representations in vision transformers, especially for the shallow layers. In this work, we introduce the STA module in the UNet architecture (STA-UNet), to limit redundancy without losing rich information. Experimental results on four publicly available datasets demonstrate the superiority of STA-UNet over existing state-of-the-art architectures in terms of Dice score and IOU for organ segmentation tasks. The code is available at \url{}. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.11578v1-abstract-full').style.display = 'none'; document.getElementById('2410.11578v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.10167</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> </div> </div> <p class="title is-5 mathjax"> X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xinyan Chen</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+J">Jianfei Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.10167v2-abstract-short" style="display: inline;"> Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion so… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.10167v2-abstract-full').style.display = 'inline'; document.getElementById('2410.10167v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.10167v2-abstract-full" style="display: none;"> Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.10167v2-abstract-full').style.display = 'none'; document.getElementById('2410.10167v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 14 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.09503</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+W">Wenxi Chen</a>, <a href="/search/eess?searchtype=author&query=Ma%2C+Z">Ziyang Ma</a>, <a href="/search/eess?searchtype=author&query=Li%2C+X">Xiquan Li</a>, <a href="/search/eess?searchtype=author&query=Xu%2C+X">Xuenan Xu</a>, <a href="/search/eess?searchtype=author&query=Liang%2C+Y">Yuzhe Liang</a>, <a href="/search/eess?searchtype=author&query=Zheng%2C+Z">Zhisheng Zheng</a>, <a href="/search/eess?searchtype=author&query=Yu%2C+K">Kai Yu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.09503v1-abstract-short" style="display: inline;"> Automated Audio Captioning (AAC) aims to generate natural textual descriptions for input audio signals. Recent progress in audio pre-trained models and large language models (LLMs) has significantly enhanced audio understanding and textual reasoning capabilities, making improvements in AAC possible. In this paper, we propose SLAM-AAC to further enhance AAC with paraphrasing augmentation and CLAP-R… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.09503v1-abstract-full').style.display = 'inline'; document.getElementById('2410.09503v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.09503v1-abstract-full" style="display: none;"> Automated Audio Captioning (AAC) aims to generate natural textual descriptions for input audio signals. Recent progress in audio pre-trained models and large language models (LLMs) has significantly enhanced audio understanding and textual reasoning capabilities, making improvements in AAC possible. In this paper, we propose SLAM-AAC to further enhance AAC with paraphrasing augmentation and CLAP-Refine through LLMs. Our approach uses the self-supervised EAT model to extract fine-grained audio representations, which are then aligned with textual embeddings via lightweight linear layers. The caption generation LLM is efficiently fine-tuned using the LoRA adapter. Drawing inspiration from the back-translation method in machine translation, we implement paraphrasing augmentation to expand the Clotho dataset during pre-training. This strategy helps alleviate the limitation of scarce audio-text pairs and generates more diverse captions from a small set of audio clips. During inference, we introduce the plug-and-play CLAP-Refine strategy to fully exploit multiple decoding outputs, akin to the n-best rescoring strategy in speech recognition. Using the CLAP model for audio-text similarity calculation, we could select the textual descriptions generated by multiple searching beams that best match the input audio. Experimental results show that SLAM-AAC achieves state-of-the-art performance on Clotho V2 and AudioCaps, surpassing previous mainstream models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.09503v1-abstract-full').style.display = 'none'; document.getElementById('2410.09503v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.09472</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Li%2C+X">Xiquan Li</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+W">Wenxi Chen</a>, <a href="/search/eess?searchtype=author&query=Ma%2C+Z">Ziyang Ma</a>, <a href="/search/eess?searchtype=author&query=Xu%2C+X">Xuenan Xu</a>, <a href="/search/eess?searchtype=author&query=Liang%2C+Y">Yuzhe Liang</a>, <a href="/search/eess?searchtype=author&query=Zheng%2C+Z">Zhisheng Zheng</a>, <a href="/search/eess?searchtype=author&query=Kong%2C+Q">Qiuqiang Kong</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.09472v1-abstract-short" style="display: inline;"> While automated audio captioning (AAC) has made notable progress, traditional fully supervised AAC models still face two critical challenges: the need for expensive audio-text pair data for training and performance degradation when transferring across domains. To overcome these limitations, we present DRCap, a data-efficient and flexible zero-shot audio captioning system that requires text-only da… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.09472v1-abstract-full').style.display = 'inline'; document.getElementById('2410.09472v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.09472v1-abstract-full" style="display: none;"> While automated audio captioning (AAC) has made notable progress, traditional fully supervised AAC models still face two critical challenges: the need for expensive audio-text pair data for training and performance degradation when transferring across domains. To overcome these limitations, we present DRCap, a data-efficient and flexible zero-shot audio captioning system that requires text-only data for training and can quickly adapt to new domains without additional fine-tuning. DRCap integrates a contrastive language-audio pre-training (CLAP) model and a large-language model (LLM) as its backbone. During training, the model predicts the ground-truth caption with a fixed text encoder from CLAP, whereas, during inference, the text encoder is replaced with the audio encoder to generate captions for audio clips in a zero-shot manner. To mitigate the modality gap of the CLAP model, we use both the projection strategy from the encoder side and the retrieval-augmented generation strategy from the decoder side. Specifically, audio embeddings are first projected onto a text embedding support to absorb extensive semantic information within the joint multi-modal space of CLAP. At the same time, similar captions retrieved from a datastore are fed as prompts to instruct the LLM, incorporating external knowledge to take full advantage of its strong generative capability. Conditioned on both the projected CLAP embedding and the retrieved similar captions, the model is able to produce a more accurate and semantically rich textual description. By tailoring the text embedding support and the caption datastore to the target domain, DRCap acquires a robust ability to adapt to new domains in a training-free manner. Experimental results demonstrate that DRCap outperforms all other zero-shot models in in-domain scenarios and achieves state-of-the-art performance in cross-domain scenarios. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.09472v1-abstract-full').style.display = 'none'; document.getElementById('2410.09472v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.08463</a> <span> [<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> </div> </div> <p class="title is-5 mathjax"> High-Efficient Near-Field Channel Characteristics Analysis for Large-Scale MIMO Communication Systems </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Jiang%2C+H">Hao Jiang</a>, <a href="/search/eess?searchtype=author&query=Shi%2C+W">Wangqi Shi</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiao Chen</a>, <a href="/search/eess?searchtype=author&query=Zhu%2C+Q">Qiuming Zhu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+Z">Zhen Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.08463v1-abstract-short" style="display: inline;"> Large-scale multiple-input multiple-output (MIMO) holds great promise for the fifth-generation (5G) and future communication systems. In near-field scenarios, the spherical wavefront model is commonly utilized to accurately depict the propagation characteristics of large-scale MIMO communication channels. However, employing this modeling method necessitates the computation of angle and distance pa… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.08463v1-abstract-full').style.display = 'inline'; document.getElementById('2410.08463v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.08463v1-abstract-full" style="display: none;"> Large-scale multiple-input multiple-output (MIMO) holds great promise for the fifth-generation (5G) and future communication systems. In near-field scenarios, the spherical wavefront model is commonly utilized to accurately depict the propagation characteristics of large-scale MIMO communication channels. However, employing this modeling method necessitates the computation of angle and distance parameters for each antenna element, resulting in challenges regarding computational complexity. To solve this problem, we introduce a subarray decomposition scheme with the purpose of dividing the whole large-scale antenna array into several smaller subarrays. This scheme is implemented in the near-field channel modeling for large-scale MIMO communications between the base stations (BS) and the mobile receiver (MR). Essential channel propagation statistics, such as spatial cross-correlation functions (CCFs), temporal auto-correlation functions (ACFs), frequency correlation functions (CFs), and channel capacities, are derived and discussed. A comprehensive analysis is conducted to investigate the influence of the height of the BS, motion characteristics of the MR, and antenna configurations on the channel statistics. The proposed channel model criterions, such as the modeling precision and computational complexity, are also theoretically compared. Numerical results demonstrate the effectiveness of the presented communication model in obtaining a good tradeoff between modeling precision and computational complexity. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.08463v1-abstract-full').style.display = 'none'; document.getElementById('2410.08463v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.06885</a> <span> [<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+Y">Yushen Chen</a>, <a href="/search/eess?searchtype=author&query=Niu%2C+Z">Zhikang Niu</a>, <a href="/search/eess?searchtype=author&query=Ma%2C+Z">Ziyang Ma</a>, <a href="/search/eess?searchtype=author&query=Deng%2C+K">Keqi Deng</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+C">Chunhui Wang</a>, <a href="/search/eess?searchtype=author&query=Zhao%2C+J">Jian Zhao</a>, <a href="/search/eess?searchtype=author&query=Yu%2C+K">Kai Yu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.06885v2-abstract-short" style="display: inline;"> This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally pr… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.06885v2-abstract-full').style.display = 'inline'; document.getElementById('2410.06885v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.06885v2-abstract-full" style="display: none;"> This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at We release all code and checkpoints to promote community development. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.06885v2-abstract-full').style.display = 'none'; document.getElementById('2410.06885v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 9 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.02592</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Systems and Control">eess.SY</span> </div> </div> <p class="title is-5 mathjax"> IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Fang%2C+Z">Zihan Fang</a>, <a href="/search/eess?searchtype=author&query=Lin%2C+Z">Zheng Lin</a>, <a href="/search/eess?searchtype=author&query=Hu%2C+S">Senkang Hu</a>, <a href="/search/eess?searchtype=author&query=Cao%2C+H">Hangcheng Cao</a>, <a href="/search/eess?searchtype=author&query=Deng%2C+Y">Yiqin Deng</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xianhao Chen</a>, <a href="/search/eess?searchtype=author&query=Fang%2C+Y">Yuguang Fang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.02592v4-abstract-short" style="display: inline;"> Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state f… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.02592v4-abstract-full').style.display = 'inline'; document.getElementById('2410.02592v4-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.02592v4-abstract-full" style="display: none;"> Recently, in-car monitoring has emerged as a promising technology for detecting early-stage abnormal status of the driver and providing timely alerts to prevent traffic accidents. Although training models with multimodal data enhances the reliability of abnormal status detection, the scarcity of labeled data and the imbalance of class distribution impede the extraction of critical abnormal state features, significantly deteriorating training performance. Furthermore, missing modalities due to environment and hardware limitations further exacerbate the challenge of abnormal status identification. More importantly, monitoring abnormal health conditions of passengers, particularly in elderly care, is of paramount importance but remains underexplored. To address these challenges, we introduce our IC3M, an efficient camera-rotation-based multimodal framework for monitoring both driver and passengers in a car. Our IC3M comprises two key modules: an adaptive threshold pseudo-labeling strategy and a missing modality reconstruction. The former customizes pseudo-labeling thresholds for different classes based on the class distribution, generating class-balanced pseudo labels to guide model training effectively, while the latter leverages crossmodality relationships learned from limited labels to accurately recover missing modalities by distribution transferring from available modalities. Extensive experimental results demonstrate that IC3M outperforms state-of-the-art benchmarks in accuracy, precision, and recall while exhibiting superior robustness under limited labeled data and severe missing modality. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.02592v4-abstract-full').style.display = 'none'; document.getElementById('2410.02592v4-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 3 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">16 pages, 17 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.16644</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Wang%2C+S">Siyin Wang</a>, <a href="/search/eess?searchtype=author&query=Yu%2C+W">Wenyi Yu</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+Y">Yudong Yang</a>, <a href="/search/eess?searchtype=author&query=Tang%2C+C">Changli Tang</a>, <a href="/search/eess?searchtype=author&query=Li%2C+Y">Yixuan Li</a>, <a href="/search/eess?searchtype=author&query=Zhuang%2C+J">Jimin Zhuang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xianzhao Chen</a>, <a href="/search/eess?searchtype=author&query=Tian%2C+X">Xiaohai Tian</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+J">Jun Zhang</a>, <a href="/search/eess?searchtype=author&query=Sun%2C+G">Guangzhi Sun</a>, <a href="/search/eess?searchtype=author&query=Lu%2C+L">Lu Lu</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+C">Chao Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.16644v1-abstract-short" style="display: inline;"> Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) etc., which can be challenging to cover using one small model designed for a single task. In this paper, we propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment. By employing task-specific… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.16644v1-abstract-full').style.display = 'inline'; document.getElementById('2409.16644v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.16644v1-abstract-full" style="display: none;"> Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) etc., which can be challenging to cover using one small model designed for a single task. In this paper, we propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment. By employing task-specific prompts, auditory LLMs are finetuned to predict MOS, SIM and A/B testing results, which are commonly used for evaluating text-to-speech systems. Additionally, the finetuned auditory LLM is able to generate natural language descriptions assessing aspects like noisiness, distortion, discontinuity, and overall quality, providing more interpretable outputs. Extensive experiments have been performed on the NISQA, BVCC, SOMOS and VoxSim speech quality datasets, using open-source auditory LLMs such as SALMONN, Qwen-Audio, and Qwen2-Audio. For the natural language descriptions task, a commercial model Google Gemini 1.5 Pro is also evaluated. The results demonstrate that auditory LLMs achieve competitive performance compared to state-of-the-art task-specific small models in predicting MOS and SIM, while also delivering promising results in A/B testing and natural language descriptions. Our data processing scripts and finetuned model checkpoints will be released upon acceptance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.16644v1-abstract-full').style.display = 'none'; document.getElementById('2409.16644v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">submitted to ICASSP 2025</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.16637</a> <span> [<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Deep-Learning Recognition of Scanning Transmission Electron Microscopy: Quantifying and Mitigating the Influence of Gaussian Noises </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Zhang%2C+H">Hanlei Zhang</a>, <a href="/search/eess?searchtype=author&query=Bai%2C+J">Jincheng Bai</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiabo Chen</a>, <a href="/search/eess?searchtype=author&query=Li%2C+C">Can Li</a>, <a href="/search/eess?searchtype=author&query=Zhong%2C+C">Chuanjian Zhong</a>, <a href="/search/eess?searchtype=author&query=Fang%2C+J">Jiye Fang</a>, <a href="/search/eess?searchtype=author&query=Zhou%2C+G">Guangwen Zhou</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.16637v1-abstract-short" style="display: inline;"> Scanning transmission electron microscopy (STEM) is a powerful tool to reveal the morphologies and structures of materials, thereby attracting intensive interests from the scientific and industrial communities. The outstanding spatial (atomic level) and temporal (ms level) resolutions of the STEM techniques generate fruitful amounts of high-definition data, thereby enabling the high-volume and hig… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.16637v1-abstract-full').style.display = 'inline'; document.getElementById('2409.16637v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.16637v1-abstract-full" style="display: none;"> Scanning transmission electron microscopy (STEM) is a powerful tool to reveal the morphologies and structures of materials, thereby attracting intensive interests from the scientific and industrial communities. The outstanding spatial (atomic level) and temporal (ms level) resolutions of the STEM techniques generate fruitful amounts of high-definition data, thereby enabling the high-volume and high-speed analysis of materials. On the other hand, processing of the big dataset generated by STEM is time-consuming and beyond the capability of human-based manual work, which urgently calls for computer-based automation. In this work, we present a deep-learning mask region-based neural network (Mask R-CNN) for the recognition of nanoparticles imaged by STEM, as well as generating the associated dimensional analysis. The Mask R-CNN model was tested on simulated STEM-HAADF results with different Gaussian noises, particle shapes and particle sizes, and the results indicated that Gaussian noise has determining influence on the accuracy of recognition. By applying Gaussian and Non-Local Means filters on the noise-containing STEM-HAADF results, the influences of noises are largely mitigated, and recognition accuracy is significantly improved. This filtering-recognition approach was further applied to experimental STEM-HAADF results, which yields satisfying accuracy compared with the traditional threshold methods. The deep-learning-based method developed in this work has great potentials in analysis of the complicated structures and large data generated by STEM-HAADF. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.16637v1-abstract-full').style.display = 'none'; document.getElementById('2409.16637v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.14085</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Wu%2C+H">Haibin Wu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xuanjun Chen</a>, <a href="/search/eess?searchtype=author&query=Lin%2C+Y">Yi-Cheng Lin</a>, <a href="/search/eess?searchtype=author&query=Chang%2C+K">Kaiwei Chang</a>, <a href="/search/eess?searchtype=author&query=Du%2C+J">Jiawei Du</a>, <a href="/search/eess?searchtype=author&query=Lu%2C+K">Ke-Han Lu</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+A+H">Alexander H. Liu</a>, <a href="/search/eess?searchtype=author&query=Chung%2C+H">Ho-Lam Chung</a>, <a href="/search/eess?searchtype=author&query=Wu%2C+Y">Yuan-Kuei Wu</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+D">Dongchao Yang</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+S">Songxiang Liu</a>, <a href="/search/eess?searchtype=author&query=Wu%2C+Y">Yi-Chiao Wu</a>, <a href="/search/eess?searchtype=author&query=Tan%2C+X">Xu Tan</a>, <a href="/search/eess?searchtype=author&query=Glass%2C+J">James Glass</a>, <a href="/search/eess?searchtype=author&query=Watanabe%2C+S">Shinji Watanabe</a>, <a href="/search/eess?searchtype=author&query=Lee%2C+H">Hung-yi Lee</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.14085v1-abstract-short" style="display: inline;"> Neural audio codec models are becoming increasingly important as they serve as tokenizers for audio, enabling efficient transmission or facilitating speech language modeling. The ideal neural audio codec should maintain content, paralinguistics, speaker characteristics, and audio information even at low bitrates. Recently, numerous advanced neural codec models have been proposed. However, codec mo… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.14085v1-abstract-full').style.display = 'inline'; document.getElementById('2409.14085v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.14085v1-abstract-full" style="display: none;"> Neural audio codec models are becoming increasingly important as they serve as tokenizers for audio, enabling efficient transmission or facilitating speech language modeling. The ideal neural audio codec should maintain content, paralinguistics, speaker characteristics, and audio information even at low bitrates. Recently, numerous advanced neural codec models have been proposed. However, codec models are often tested under varying experimental conditions. As a result, we introduce the Codec-SUPERB challenge at SLT 2024, designed to facilitate fair and lightweight comparisons among existing codec models and inspire advancements in the field. This challenge brings together representative speech applications and objective metrics, and carefully selects license-free datasets, sampling them into small sets to reduce evaluation computation costs. This paper presents the challenge's rules, datasets, five participant systems, results, and findings. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.14085v1-abstract-full').style.display = 'none'; document.getElementById('2409.14085v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.12717</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Niu%2C+Z">Zhikang Niu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+S">Sanyuan Chen</a>, <a href="/search/eess?searchtype=author&query=Zhou%2C+L">Long Zhou</a>, <a href="/search/eess?searchtype=author&query=Ma%2C+Z">Ziyang Ma</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+S">Shujie Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.12717v1-abstract-short" style="display: inline;"> Built upon vector quantization (VQ), discrete audio codec models have achieved great success in audio compression and auto-regressive audio generation. However, existing models face substantial challenges in perceptual quality and signal distortion, especially when operating in extremely low bandwidth, rooted in the sensitivity of the VQ codebook to noise. This degradation poses significant challe… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.12717v1-abstract-full').style.display = 'inline'; document.getElementById('2409.12717v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.12717v1-abstract-full" style="display: none;"> Built upon vector quantization (VQ), discrete audio codec models have achieved great success in audio compression and auto-regressive audio generation. However, existing models face substantial challenges in perceptual quality and signal distortion, especially when operating in extremely low bandwidth, rooted in the sensitivity of the VQ codebook to noise. This degradation poses significant challenges for several downstream tasks, such as codec-based speech synthesis. To address this issue, we propose a novel VQ method, Normal Distribution-based Vector Quantization (NDVQ), by introducing an explicit margin between the VQ codes via learning a variance. Specifically, our approach involves mapping the waveform to a latent space and quantizing it by selecting the most likely normal distribution, with each codebook entry representing a unique normal distribution defined by its mean and variance. Using these distribution-based VQ codec codes, a decoder reconstructs the input waveform. NDVQ is trained with additional distribution-related losses, alongside reconstruction and discrimination losses. Experiments demonstrate that NDVQ outperforms existing audio compression baselines, such as EnCodec, in terms of audio quality and zero-shot TTS, particularly in very low bandwidth scenarios. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.12717v1-abstract-full').style.display = 'none'; document.getElementById('2409.12717v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.11543</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> </div> </div> <p class="title is-5 mathjax"> Noise-aware Dynamic Image Denoising and Positron Range Correction for Rubidium-82 Cardiac PET Imaging via Self-supervision </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Xie%2C+H">Huidong Xie</a>, <a href="/search/eess?searchtype=author&query=Guo%2C+L">Liang Guo</a>, <a href="/search/eess?searchtype=author&query=Velo%2C+A">Alexandre Velo</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+Z">Zhao Liu</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+Q">Qiong Liu</a>, <a href="/search/eess?searchtype=author&query=Guo%2C+X">Xueqi Guo</a>, <a href="/search/eess?searchtype=author&query=Zhou%2C+B">Bo Zhou</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiongchao Chen</a>, <a href="/search/eess?searchtype=author&query=Tsai%2C+Y">Yu-Jung Tsai</a>, <a href="/search/eess?searchtype=author&query=Miao%2C+T">Tianshun Miao</a>, <a href="/search/eess?searchtype=author&query=Xia%2C+M">Menghua Xia</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+Y">Yi-Hwa Liu</a>, <a href="/search/eess?searchtype=author&query=Armstrong%2C+I+S">Ian S. Armstrong</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+G">Ge Wang</a>, <a href="/search/eess?searchtype=author&query=Carson%2C+R+E">Richard E. Carson</a>, <a href="/search/eess?searchtype=author&query=Sinusas%2C+A+J">Albert J. Sinusas</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+C">Chi Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.11543v1-abstract-short" style="display: inline;"> Rb-82 is a radioactive isotope widely used for cardiac PET imaging. Despite numerous benefits of 82-Rb, there are several factors that limits its image quality and quantitative accuracy. First, the short half-life of 82-Rb results in noisy dynamic frames. Low signal-to-noise ratio would result in inaccurate and biased image quantification. Noisy dynamic frames also lead to highly noisy parametric… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.11543v1-abstract-full').style.display = 'inline'; document.getElementById('2409.11543v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.11543v1-abstract-full" style="display: none;"> Rb-82 is a radioactive isotope widely used for cardiac PET imaging. Despite numerous benefits of 82-Rb, there are several factors that limits its image quality and quantitative accuracy. First, the short half-life of 82-Rb results in noisy dynamic frames. Low signal-to-noise ratio would result in inaccurate and biased image quantification. Noisy dynamic frames also lead to highly noisy parametric images. The noise levels also vary substantially in different dynamic frames due to radiotracer decay and short half-life. Existing denoising methods are not applicable for this task due to the lack of paired training inputs/labels and inability to generalize across varying noise levels. Second, 82-Rb emits high-energy positrons. Compared with other tracers such as 18-F, 82-Rb travels a longer distance before annihilation, which negatively affect image spatial resolution. Here, the goal of this study is to propose a self-supervised method for simultaneous (1) noise-aware dynamic image denoising and (2) positron range correction for 82-Rb cardiac PET imaging. Tested on a series of PET scans from a cohort of normal volunteers, the proposed method produced images with superior visual quality. To demonstrate the improvement in image quantification, we compared image-derived input functions (IDIFs) with arterial input functions (AIFs) from continuous arterial blood samples. The IDIF derived from the proposed method led to lower AUC differences, decreasing from 11.09% to 7.58% on average, compared to the original dynamic frames. The proposed method also improved the quantification of myocardium blood flow (MBF), as validated against 15-O-water scans, with mean MBF differences decreased from 0.43 to 0.09, compared to the original dynamic frames. We also conducted a generalizability experiment on 37 patient scans obtained from a different country using a different scanner. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.11543v1-abstract-full').style.display = 'none'; document.getElementById('2409.11543v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">15 Pages, 10 Figures, 5 tables. Paper Under review. Oral Presentation at IEEE MIC 2023</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.11069</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Systems and Control">eess.SY</span> </div> </div> <p class="title is-5 mathjax"> Data-driven Dynamic Intervention Design in Network Games </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiupeng Chen</a>, <a href="/search/eess?searchtype=author&query=Monshizadeh%2C+N">Nima Monshizadeh</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.11069v1-abstract-short" style="display: inline;"> Targeted interventions in games present a challenging problem due to the asymmetric information available to the regulator and the agents. This note addresses the problem of steering the actions of self-interested agents in quadratic network games towards a target action profile. A common starting point in the literature assumes prior knowledge of utility functions and/or network parameters. The g… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.11069v1-abstract-full').style.display = 'inline'; document.getElementById('2409.11069v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.11069v1-abstract-full" style="display: none;"> Targeted interventions in games present a challenging problem due to the asymmetric information available to the regulator and the agents. This note addresses the problem of steering the actions of self-interested agents in quadratic network games towards a target action profile. A common starting point in the literature assumes prior knowledge of utility functions and/or network parameters. The goal of the results presented here is to remove this assumption and address scenarios where such a priori knowledge is unavailable. To this end, we design a data-driven dynamic intervention mechanism that relies solely on historical observations of agent actions and interventions. Additionally, we modify this mechanism to limit the amount of interventions, thereby considering budget constraints. Analytical convergence guarantees are provided for both mechanisms, and a numerical case study further demonstrates their effectiveness. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.11069v1-abstract-full').style.display = 'none'; document.getElementById('2409.11069v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.10969</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Xu%2C+J">Jing Xu</a>, <a href="/search/eess?searchtype=author&query=Tan%2C+D">Daxin Tan</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+J">Jiaqi Wang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiao Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.10969v1-abstract-short" style="display: inline;"> While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.10969v1-abstract-full').style.display = 'inline'; document.getElementById('2409.10969v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.10969v1-abstract-full" style="display: none;"> While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM. Furthermore, we develop an effective data construction approach that splits and concatenates words from different languages to equip LLMs with CS synthesis ability without relying on CS data. The experimental results demonstrate that our model outperforms other baselines with a comparable data scale. Furthermore, our data construction approach not only equips LLMs with CS speech synthesis capability with comparable speaker consistency and similarity to any given speaker, but also improves the performance of LLMs in multilingual speech generation and recognition tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.10969v1-abstract-full').style.display = 'none'; document.getElementById('2409.10969v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Submitted to ICASSP 2025</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.10966</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> CUNSB-RFIE: Context-aware Unpaired Neural Schr枚dinger Bridge in Retinal Fundus Image Enhancement </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Dong%2C+X">Xuanzhao Dong</a>, <a href="/search/eess?searchtype=author&query=Vasa%2C+V+K">Vamsi Krishna Vasa</a>, <a href="/search/eess?searchtype=author&query=Zhu%2C+W">Wenhui Zhu</a>, <a href="/search/eess?searchtype=author&query=Qiu%2C+P">Peijie Qiu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiwen Chen</a>, <a href="/search/eess?searchtype=author&query=Su%2C+Y">Yi Su</a>, <a href="/search/eess?searchtype=author&query=Xiong%2C+Y">Yujian Xiong</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+Z">Zhangsihao Yang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+Y">Yanxi Chen</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+Y">Yalin Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.10966v1-abstract-short" style="display: inline;"> Retinal fundus photography is significant in diagnosing and monitoring retinal diseases. However, systemic imperfections and operator/patient-related factors can hinder the acquisition of high-quality retinal images. Previous efforts in retinal image enhancement primarily relied on GANs, which are limited by the trade-off between training stability and output diversity. In contrast, the Schr枚dinge… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.10966v1-abstract-full').style.display = 'inline'; document.getElementById('2409.10966v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.10966v1-abstract-full" style="display: none;"> Retinal fundus photography is significant in diagnosing and monitoring retinal diseases. However, systemic imperfections and operator/patient-related factors can hinder the acquisition of high-quality retinal images. Previous efforts in retinal image enhancement primarily relied on GANs, which are limited by the trade-off between training stability and output diversity. In contrast, the Schr枚dinger Bridge (SB), offers a more stable solution by utilizing Optimal Transport (OT) theory to model a stochastic differential equation (SDE) between two arbitrary distributions. This allows SB to effectively transform low-quality retinal images into their high-quality counterparts. In this work, we leverage the SB framework to propose an image-to-image translation pipeline for retinal image enhancement. Additionally, previous methods often fail to capture fine structural details, such as blood vessels. To address this, we enhance our pipeline by introducing Dynamic Snake Convolution, whose tortuous receptive field can better preserve tubular structures. We name the resulting retinal fundus image enhancement framework the Context-aware Unpaired Neural Schr枚dinger Bridge (CUNSB-RFIE). To the best of our knowledge, this is the first endeavor to use the SB approach for retinal image enhancement. Experimental results on a large-scale dataset demonstrate the advantage of the proposed method compared to several state-of-the-art supervised and unsupervised methods in terms of image quality and performance on downstream tasks.The code is available at . <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.10966v1-abstract-full').style.display = 'none'; document.getElementById('2409.10966v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.10376</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Ren%2C+W">Wenze Ren</a>, <a href="/search/eess?searchtype=author&query=Wu%2C+H">Haibin Wu</a>, <a href="/search/eess?searchtype=author&query=Lin%2C+Y">Yi-Cheng Lin</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xuanjun Chen</a>, <a href="/search/eess?searchtype=author&query=Chao%2C+R">Rong Chao</a>, <a href="/search/eess?searchtype=author&query=Hung%2C+K">Kuo-Hsuan Hung</a>, <a href="/search/eess?searchtype=author&query=Li%2C+Y">You-Jin Li</a>, <a href="/search/eess?searchtype=author&query=Ting%2C+W">Wen-Yuan Ting</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+H">Hsin-Min Wang</a>, <a href="/search/eess?searchtype=author&query=Tsao%2C+Y">Yu Tsao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.10376v1-abstract-short" style="display: inline;"> In multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction. Traditional methods, such as CNN or LSTM, attempt to model the temporal dynamics of full-band and sub-band spectral and spatial features. However, these approaches face limitations in fully modeling complex temporal dependencies, especially in dyna… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.10376v1-abstract-full').style.display = 'inline'; document.getElementById('2409.10376v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.10376v1-abstract-full" style="display: none;"> In multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction. Traditional methods, such as CNN or LSTM, attempt to model the temporal dynamics of full-band and sub-band spectral and spatial features. However, these approaches face limitations in fully modeling complex temporal dependencies, especially in dynamic acoustic environments. To overcome these challenges, we modify the current advanced model McNet by introducing an improved version of Mamba, a state-space model, and further propose MCMamba. MCMamba has been completely reengineered to integrate full-band and narrow-band spatial information with sub-band and full-band spectral features, providing a more comprehensive approach to modeling spatial and spectral information. Our experimental results demonstrate that MCMamba significantly improves the modeling of spatial and spectral features in multichannel speech enhancement, outperforming McNet and achieving state-of-the-art performance on the CHiME-3 dataset. Additionally, we find that Mamba performs exceptionally well in modeling spectral information. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.10376v1-abstract-full').style.display = 'none'; document.getElementById('2409.10376v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.09876</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Systems and Control">eess.SY</span> </div> </div> <p class="title is-5 mathjax"> A Carryover Storage Quantification Framework for Mid-Term Cascaded Hydropower Planning: A Portland General Electric System Study </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xianbang Chen</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+Y">Yikui Liu</a>, <a href="/search/eess?searchtype=author&query=Zhong%2C+Z">Zhiming Zhong</a>, <a href="/search/eess?searchtype=author&query=Fan%2C+N">Neng Fan</a>, <a href="/search/eess?searchtype=author&query=Zhao%2C+Z">Zhechong Zhao</a>, <a href="/search/eess?searchtype=author&query=Wu%2C+L">Lei Wu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.09876v1-abstract-short" style="display: inline;"> Mid-term planning of cascaded hydropower systems (CHSs) determines appropriate carryover storage levels in reservoirs to optimize the usage of available water resources, i.e., maximizing the hydropower generated in the current period (i.e., immediate benefit) plus the potential hydropower generation in the future period (i.e., future value). Thus, in the mid-term CHS planning, properly quantifying… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.09876v1-abstract-full').style.display = 'inline'; document.getElementById('2409.09876v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.09876v1-abstract-full" style="display: none;"> Mid-term planning of cascaded hydropower systems (CHSs) determines appropriate carryover storage levels in reservoirs to optimize the usage of available water resources, i.e., maximizing the hydropower generated in the current period (i.e., immediate benefit) plus the potential hydropower generation in the future period (i.e., future value). Thus, in the mid-term CHS planning, properly quantifying the future value deposited in carryover storage is essential to achieve a good balance between immediate benefit and future value. To this end, this paper presents a framework to quantify the future value of carryover storage, which consists of three major steps: i) constructing a module to calculate the maximum possible hydropower generation that a given level of carryover storage can deliver in the future period; ii) extracting the implicit locational marginal water value (LMWV) of carryover storage for each reservoir by applying a partition-then-extract algorithm to the constructed module; and iii) developing a set of analytical rules based on the extracted LMWV to effectively calculate the future value. These rules can be seamlessly integrated into mid-term CHS planning models as tractable mixed-integer linear constraints to quantify the future value properly, and can be easily visualized to offer valuable insights for CHS operators. Finally, numerical results on a CHS of Portland General Electric demonstrate the effectiveness of the presented framework in determining proper carryover storage values to facilitate mid-term CHS planning. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.09876v1-abstract-full').style.display = 'none'; document.getElementById('2409.09876v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.08805</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> Exploring SSL Discrete Tokens for Multilingual ASR </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Cui%2C+M">Mingyu Cui</a>, <a href="/search/eess?searchtype=author&query=Tan%2C+D">Daxin Tan</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+Y">Yifan Yang</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+D">Dingdong Wang</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+H">Huimeng Wang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiao Chen</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+X">Xunying Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.08805v1-abstract-short" style="display: inline;"> With the advancement of Self-supervised Learning (SSL) in speech-related tasks, there has been growing interest in utilizing discrete tokens generated by SSL for automatic speech recognition (ASR), as they offer faster processing techniques. However, previous studies primarily focused on multilingual ASR with Fbank features or English ASR with discrete tokens, leaving a gap in adapting discrete to… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.08805v1-abstract-full').style.display = 'inline'; document.getElementById('2409.08805v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.08805v1-abstract-full" style="display: none;"> With the advancement of Self-supervised Learning (SSL) in speech-related tasks, there has been growing interest in utilizing discrete tokens generated by SSL for automatic speech recognition (ASR), as they offer faster processing techniques. However, previous studies primarily focused on multilingual ASR with Fbank features or English ASR with discrete tokens, leaving a gap in adapting discrete tokens for multilingual ASR scenarios. This study presents a comprehensive comparison of discrete tokens generated by various leading SSL models across multiple language domains. We aim to explore the performance and efficiency of speech discrete tokens across multiple language domains for both monolingual and multilingual ASR scenarios. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on Fbank features in ASR tasks across seven language domains with an average word error rate (WER) reduction of 0.31% and 1.76% absolute (2.80% and 15.70% relative) on dev and test sets respectively, with particularly WER reduction of 6.82% absolute (41.48% relative) on the Polish test set. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.08805v1-abstract-full').style.display = 'none'; document.getElementById('2409.08805v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Submitted to ICASSP 2025</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.08797</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Cui%2C+M">Mingyu Cui</a>, <a href="/search/eess?searchtype=author&query=Yang%2C+Y">Yifan Yang</a>, <a href="/search/eess?searchtype=author&query=Deng%2C+J">Jiajun Deng</a>, <a href="/search/eess?searchtype=author&query=Kang%2C+J">Jiawen Kang</a>, <a href="/search/eess?searchtype=author&query=Hu%2C+S">Shujie Hu</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+T">Tianzi Wang</a>, <a href="/search/eess?searchtype=author&query=Li%2C+Z">Zhaoqing Li</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+S">Shiliang Zhang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a>, <a href="/search/eess?searchtype=author&query=Liu%2C+X">Xunying Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.08797v1-abstract-short" style="display: inline;"> Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems. The efficacy of replacing Fbank features with discrete token features for modelling either cross-utterance contexts… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.08797v1-abstract-full').style.display = 'inline'; document.getElementById('2409.08797v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.08797v1-abstract-full" style="display: none;"> Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems. The efficacy of replacing Fbank features with discrete token features for modelling either cross-utterance contexts (from preceding and future segments), or current utterance's internal contexts alone, or both at the same time, are demonstrated thoroughly on the Gigaspeech 1000-hr corpus. The best Zipformer-Transducer system using discrete tokens based cross-utterance context features outperforms the baseline using utterance internal context only with statistically significant word error rate (WER) reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) on the dev and test data. The lowest published WER of 11.15% and 11.14% were obtained on the dev and test sets. Our work is open-source and publicly available at\_ASR. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.08797v1-abstract-full').style.display = 'none'; document.getElementById('2409.08797v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Submitted to ICASSP 2025</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.08731</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> </div> </div> <p class="title is-5 mathjax"> DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Du%2C+J">Jiawei Du</a>, <a href="/search/eess?searchtype=author&query=Lin%2C+I">I-Ming Lin</a>, <a href="/search/eess?searchtype=author&query=Chiu%2C+I">I-Hsiang Chiu</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xuanjun Chen</a>, <a href="/search/eess?searchtype=author&query=Wu%2C+H">Haibin Wu</a>, <a href="/search/eess?searchtype=author&query=Ren%2C+W">Wenze Ren</a>, <a href="/search/eess?searchtype=author&query=Tsao%2C+Y">Yu Tsao</a>, <a href="/search/eess?searchtype=author&query=Lee%2C+H">Hung-yi Lee</a>, <a href="/search/eess?searchtype=author&query=Jang%2C+J+R">Jyh-Shing Roger Jang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.08731v1-abstract-short" style="display: inline;"> Mainstream zero-shot TTS production systems like Voicebox and Seed-TTS achieve human parity speech by leveraging Flow-matching and Diffusion models, respectively. Unfortunately, human-level audio synthesis leads to identity misuse and information security issues. Currently, many antispoofing models have been developed against deepfake audio. However, the efficacy of current state-of-the-art anti-s… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.08731v1-abstract-full').style.display = 'inline'; document.getElementById('2409.08731v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.08731v1-abstract-full" style="display: none;"> Mainstream zero-shot TTS production systems like Voicebox and Seed-TTS achieve human parity speech by leveraging Flow-matching and Diffusion models, respectively. Unfortunately, human-level audio synthesis leads to identity misuse and information security issues. Currently, many antispoofing models have been developed against deepfake audio. However, the efficacy of current state-of-the-art anti-spoofing models in countering audio synthesized by diffusion and flowmatching based TTS systems remains unknown. In this paper, we proposed the Diffusion and Flow-matching based Audio Deepfake (DFADD) dataset. The DFADD dataset collected the deepfake audio based on advanced diffusion and flowmatching TTS models. Additionally, we reveal that current anti-spoofing models lack sufficient robustness against highly human-like audio generated by diffusion and flow-matching TTS systems. The proposed DFADD dataset addresses this gap and provides a valuable resource for developing more resilient anti-spoofing models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.08731v1-abstract-full').style.display = 'none'; document.getElementById('2409.08731v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by IEEE SLT 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.08080</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> </div> </div> <p class="title is-5 mathjax"> Electromagnetic Normalization of Channel Matrix for Holographic MIMO Communications </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Yuan%2C+S+S+A">Shuai S. A. Yuan</a>, <a href="/search/eess?searchtype=author&query=Wei%2C+L">Li Wei</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiaoming Chen</a>, <a href="/search/eess?searchtype=author&query=Huang%2C+C">Chongwen Huang</a>, <a href="/search/eess?searchtype=author&query=Sha%2C+W+E+I">Wei E. I. Sha</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.08080v1-abstract-short" style="display: inline;"> Holographic multiple-input and multiple-output (MIMO) communications introduce innovative antenna array configurations, such as dense arrays and volumetric arrays, which offer notable advantages over conventional planar arrays with half-wavelength element spacing. However, accurately assessing the performance of these new holographic MIMO systems necessitates careful consideration of channel matri… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.08080v1-abstract-full').style.display = 'inline'; document.getElementById('2409.08080v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.08080v1-abstract-full" style="display: none;"> Holographic multiple-input and multiple-output (MIMO) communications introduce innovative antenna array configurations, such as dense arrays and volumetric arrays, which offer notable advantages over conventional planar arrays with half-wavelength element spacing. However, accurately assessing the performance of these new holographic MIMO systems necessitates careful consideration of channel matrix normalization, as it is influenced by array gain, which, in turn, depends on the array topology. Traditional normalization methods may be insufficient for assessing these advanced array topologies, potentially resulting in misleading or inaccurate evaluations. In this study, we propose electromagnetic normalization approaches for the channel matrix that accommodate arbitrary array topologies, drawing on the array gains from analytical, physical, and full-wave methods. Additionally, we introduce a normalization method for near-field MIMO channels based on a rigorous dyadic Green's function approach, which accounts for potential losses of gain at near field. Finally, we perform capacity analyses under quasi-static, ergodic, and near-field conditions, through adopting the proposed normalization techniques. Our findings indicate that channel matrix normalization should reflect the realized gains of the antenna array in target directions. Failing to accurately normalize the channel matrix can result in errors when evaluating the performance limits and benefits of unconventional holographic array topologies, potentially compromising the optimal design of holographic MIMO systems. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.08080v1-abstract-full').style.display = 'none'; document.getElementById('2409.08080v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.07040</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> </div> </div> <p class="title is-5 mathjax"> Retinex-RAWMamba: Bridging Demosaicing and Denoising for Low-Light RAW Image Enhancement </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xianmin Chen</a>, <a href="/search/eess?searchtype=author&query=Huang%2C+P">Peiliang Huang</a>, <a href="/search/eess?searchtype=author&query=Feng%2C+X">Xiaoxu Feng</a>, <a href="/search/eess?searchtype=author&query=Zhang%2C+D">Dingwen Zhang</a>, <a href="/search/eess?searchtype=author&query=Han%2C+L">Longfei Han</a>, <a href="/search/eess?searchtype=author&query=Han%2C+J">Junwei Han</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.07040v1-abstract-short" style="display: inline;"> Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoisin… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.07040v1-abstract-full').style.display = 'inline'; document.getElementById('2409.07040v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.07040v1-abstract-full" style="display: none;"> Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoising performance. In contrast, two-stage approaches typically decompose a raw image with color filter arrays (CFA) into a four-channel RGGB format before feeding it into a neural network. However, this strategy overlooks the critical role of demosaicing within the Image Signal Processing (ISP) pipeline, leading to color distortions under varying lighting conditions, especially in low-light scenarios. To address these issues, we design a novel Mamba scanning mechanism, called RAWMamba, to effectively handle raw images with different CFAs. Furthermore, we present a Retinex Decomposition Module (RDM) grounded in Retinex prior, which decouples illumination from reflectance to facilitate more effective denoising and automatic non-linear exposure correction. By bridging demosaicing and denoising, better raw image enhancement is achieved. Experimental evaluations conducted on public datasets SID and MCR demonstrate that our proposed RAWMamba achieves state-of-the-art performance on cross-domain mapping. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.07040v1-abstract-full').style.display = 'none'; document.getElementById('2409.07040v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.06035</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Analyzing Tumors by Synthesis </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Chen%2C+Q">Qi Chen</a>, <a href="/search/eess?searchtype=author&query=Lai%2C+Y">Yuxiang Lai</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xiaoxi Chen</a>, <a href="/search/eess?searchtype=author&query=Hu%2C+Q">Qixin Hu</a>, <a href="/search/eess?searchtype=author&query=Yuille%2C+A">Alan Yuille</a>, <a href="/search/eess?searchtype=author&query=Zhou%2C+Z">Zongwei Zhou</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.06035v1-abstract-short" style="display: inline;"> Computer-aided tumor detection has shown great potential in enhancing the interpretation of over 80 million CT scans performed annually in the United States. However, challenges arise due to the rarity of CT scans with tumors, especially early-stage tumors. Developing AI with real tumor data faces issues of scarcity, annotation difficulty, and low prevalence. Tumor synthesis addresses these challe… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.06035v1-abstract-full').style.display = 'inline'; document.getElementById('2409.06035v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.06035v1-abstract-full" style="display: none;"> Computer-aided tumor detection has shown great potential in enhancing the interpretation of over 80 million CT scans performed annually in the United States. However, challenges arise due to the rarity of CT scans with tumors, especially early-stage tumors. Developing AI with real tumor data faces issues of scarcity, annotation difficulty, and low prevalence. Tumor synthesis addresses these challenges by generating numerous tumor examples in medical images, aiding AI training for tumor detection and segmentation. Successful synthesis requires realistic and generalizable synthetic tumors across various organs. This chapter reviews AI development on real and synthetic data and summarizes two key trends in synthetic data for cancer imaging research: modeling-based and learning-based approaches. Modeling-based methods, like Pixel2Cancer, simulate tumor development over time using generic rules, while learning-based methods, like DiffTumor, learn from a few annotated examples in one organ to generate synthetic tumors in others. Reader studies with expert radiologists show that synthetic tumors can be convincingly realistic. We also present case studies in the liver, pancreas, and kidneys reveal that AI trained on synthetic tumors can achieve performance comparable to, or better than, AI only trained on real data. Tumor synthesis holds significant promise for expanding datasets, enhancing AI reliability, improving tumor detection performance, and preserving patient privacy. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.06035v1-abstract-full').style.display = 'none'; document.getElementById('2409.06035v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted as a chapter in the Springer Book: "Generative Machine Learning Models in Medical Image Computing."</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.01995</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/eess?searchtype=author&query=Guo%2C+Y">Yiwei Guo</a>, <a href="/search/eess?searchtype=author&query=Li%2C+Z">Zhihan Li</a>, <a href="/search/eess?searchtype=author&query=Li%2C+J">Junjie Li</a>, <a href="/search/eess?searchtype=author&query=Du%2C+C">Chenpeng Du</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+H">Hankun Wang</a>, <a href="/search/eess?searchtype=author&query=Wang%2C+S">Shuai Wang</a>, <a href="/search/eess?searchtype=author&query=Chen%2C+X">Xie Chen</a>, <a href="/search/eess?searchtype=author&query=Yu%2C+K">Kai Yu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.01995v2-abstract-short" style="display: inline;"> We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adap… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.01995v2-abstract-full').style.display = 'inline'; document.getElementById('2409.01995v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.01995v2-abstract-full" style="display: none;"> We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.01995v2-abstract-full').style.display = 'none'; document.getElementById('2409.01995v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 11 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 3 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">5 pages, 4 figures. Submitted to ICASSP 2025. Demo page: 