arXiv:2411.12360
An Affine Equivalence Algorithm for S-boxes based on Matrix Invariants
Authors: Xincheng Hu, Xiao Zeng, Zhaoqiang Liu, Guowu Yang
Abstract: We investigate the affine equivalence (AE) problem of S-boxes. href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guowu Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.12360v1-abstract-short" style="display: inline;"> We investigate the affine equivalence (AE) problem of S-boxes. Given two S-boxes denoted as $S_1$ and $S_2$, we aim to seek two invertible AE transformations $A,B$ such that $S_1\circ A = B\circ S_2$ holds. Due to important applications in the analysis and design of block ciphers, the investigation of AE algorithms has performed growing significance. In this paper, we propose zeroization on S-bo&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12360v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12360v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12360v1-abstract-full" style="display: none;"> We investigate the affine equivalence (AE) problem of S-boxes. Given two S-boxes denoted as $S_1$ and $S_2$, we aim to seek two invertible AE transformations $A,B$ such that $S_1\circ A = B\circ S_2$ holds. Due to important applications in the analysis and design of block ciphers, the investigation of AE algorithms has performed growing significance. In this paper, we propose zeroization on S-box firstly, and the AE problem can be transformed into $2^n$ linear equivalence problems by this zeroization operation. Secondly, we propose standard orthogonal spatial matrix (SOSM), and the rank of the SOSM is invariant under AE transformations. Finally, based on the zeroization operation and the SOSM method, we propose a depth first search (DFS) method for determining AE of S-boxes, named the AE\_SOSM\_DFS algorithm. Using this matrix invariant, we optimize the temporal complexity of the algorithm to approximately $\frac{1}{2^n}$ of the complexity without SOSM. Specifically, the complexity of our algorithm is $O(2^{3n})$. In addition, we also conducted experiments with non-invertible S-boxes, and the performance is similar to that of invertible S-boxes. Moreover, our proposed algorithm can effectively handle S-boxes with low algebraic degree or certain popular S-boxes such as namely AES and ARIA\_s2, which are difficult to be handled by the algorithm proposed by Dinur (2018). arXiv:2411.10996
Gadgetless Lifting Beats Round Elimination: Improved Lower Bounds for Pointer Chasing
Authors: Xinyu Mao, Guangxu Yang, Jiapeng Zhang
Abstract: We prove an Ω(n/k+k) communication lower bound on (k-1)-round distributional complexity of the k-step pointer chasing problem under uniform input distribution, improving the Ω(n/k - k log n) lower bound due to Yehudayoff (Combinatorics Probability and Computing, 2020). Our lower bound almost matches the upper bound of O(n/k + k) communication by Nisan and Wigderson (STOC 91). As part of our appr&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10996v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10996v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10996v1-abstract-full" style="display: none;"> We prove an 惟(n/k+k) communication lower bound on (k-1)-round distributional complexity of the k-step pointer chasing problem under uniform input distribution, improving the 惟(n/k - k log n) lower bound due to Yehudayoff (Combinatorics Probability and Computing, 2020). Our lower bound almost matches the upper bound of O(n/k + k) communication by Nisan and Wigderson (STOC 91). As part of our approach, we put forth gadgetless lifting, a new framework that lifts lower bounds for a family of restricted protocols into lower bounds for general protocols. A key step in gadgetless lifting is choosing the appropriate definition of restricted protocols. In this paper, our definition of restricted protocols is inspired by the structure-vs-pseudorandomness decomposition by G枚枚s, Pitassi, and Watson (FOCS 17) and Yang and Zhang (STOC 24). Previously, round-communication trade-offs were mainly obtained by round elimination and information complexity. arXiv:2411.10789
Anatomy-Guided Radiology Report Generation with Pathology-Aware Regional Prompts
Authors: Yijian Gao, Dominic Marshall, Xiaodan Xing, Junzhi Ning, Giorgos Papanastasiou, Guang Yang, Matthieu Komorowski
Abstract: Radiology reporting generative AI holds significant potential to alleviate clinical workloads and streamline medical care. However, achieving high clinical accuracy is challenging, as radiological images often feature subtle lesions and intricate structures. Existing systems often fall short, largely due to their reliance on fixed size, patch-level image features and insufficient incorporation of&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10789v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10789v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10789v1-abstract-full" style="display: none;"> Radiology reporting generative AI holds significant potential to alleviate clinical workloads and streamline medical care. However, achieving high clinical accuracy is challenging, as radiological images often feature subtle lesions and intricate structures. Existing systems often fall short, largely due to their reliance on fixed size, patch-level image features and insufficient incorporation of pathological information. This can result in the neglect of such subtle patterns and inconsistent descriptions of crucial pathologies. To address these challenges, we propose an innovative approach that leverages pathology-aware regional prompts to explicitly integrate anatomical and pathological information of various scales, significantly enhancing the precision and clinical relevance of generated reports. We develop an anatomical region detector that extracts features from distinct anatomical areas, coupled with a novel multi-label lesion detector that identifies global pathologies. Our approach emulates the diagnostic process of radiologists, producing clinically accurate reports with comprehensive diagnostic capabilities. arXiv:2411.10669
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
Authors: Jinqiang Long, Yanqi Dai, Guoxing Yang, Hongpeng Lin, Nanyi Fei, Yizhao Gao, Zhiwu Lu
Abstract: As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks togethe&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.10669v1-abstract-full').style.display = 'inline'; document.getElementById('2411.10669v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.10669v1-abstract-full" style="display: none;"> As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict&#34; issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. arXiv:2411.08993
Parameter Inference via Differentiable Diffusion Bridge Importance Sampling
Authors: Nicklas Boserup, Gefan Yang, Michael Lind Severinsen, Christy Anna Hipsley, Stefan Sommer
Abstract: We introduce a methodology for performing parameter inference in high-dimensional, non-linear diffusion processes. We illustrate its applicability for obtaining insights into the evolution of and relationships between species, including ancestral state reconstruction. Estimation is performed by utilising score matching to approximate diffusion bridges, which are subsequently used in an importance&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.08993v1-abstract-full').style.display = 'inline'; document.getElementById('2411.08993v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.08993v1-abstract-full" style="display: none;"> We introduce a methodology for performing parameter inference in high-dimensional, non-linear diffusion processes. We illustrate its applicability for obtaining insights into the evolution of and relationships between species, including ancestral state reconstruction. Estimation is performed by utilising score matching to approximate diffusion bridges, which are subsequently used in an importance sampler to estimate log-likelihoods. The entire setup is differentiable, allowing gradient ascent on approximated log-likelihoods. This allows both parameter inference and diffusion mean estimation. arXiv:2411.06680
Anchor Attention, Small Cache: Code Generation with Large Language Models
Authors: Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen
Abstract: The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns.
Submitted 10 November, 2024; originally announced November 2024.
Comments: 14 pages, 8 figures
MSC Class: 68N19 ACM Class: D.2.3 Gall</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+T">Taolue Chen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.06680v1-abstract-short" style="display: inline;"> The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06680v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06680v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06680v1-abstract-full" style="display: none;"> The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate the need of repeated attention computations, but brings significant memory overhead. Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. In this paper, we analyze the attention weights distribution within code generation models via an empirical study, uncovering a sparsity pattern, i.e., the aggregation of information at specific anchor points. Based on this observation, we propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression. arXiv:2411.06667
DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions
Authors: Shu-Tong Niu, Jun Du, Ruo-Yu Wang, Gao-Bin Yang, Tian Gao, Jia Pan, Yu Hu
Abstract: We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end speech recognition, combining neural speaker diarization (NSD) and speech separation (SS).
Submitted 10 November, 2024; originally announced November 2024. First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization module effectively.&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06667v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06667v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06667v1-abstract-full" style="display: none;"> We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end speech recognition, combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization module effectively. Then, to complement DCF-DS training, we introduce a window-level decoding scheme that allows the DCF-DS framework to handle the sparse data convergence instability (SDCI) problem. We also explore using an NSD system trained on real datasets to provide more accurate speaker boundaries during decoding. Additionally, we incorporate an optional multi-input multi-output speech enhancement module (MIMO-SE) within the DCF-DS framework, which offers further performance gains. Finally, we enhance diarization results by re-clustering DCF-DS outputs, improving ASR accuracy. By incorporating the DCF-DS method, we achieved first place in the realistic single-channel track of the CHiME-8 NOTSOFAR-1 challenge. arXiv:2411.06437
CTC-Assisted LLM-Based Contextual ASR
Authors: Guanrou Yang, Ziyang Ma, Zhifu Gao, Shiliang Zhang, Xie Chen
Abstract: Contextual ASR or hotword customization holds substantial practical value.
Submitted 10 November, 2024; originally announced November 2024.
Comments: SLT 2024 Despite the impressive performance of current end-to-end (E2E) automatic speech recognition (ASR) systems, they often face challenges in accurately recognizing rare words. Typical E2E contextual ASR models commonly feature complex architectures and decoding mechanisms, limited in performance and susceptible to interference&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.06437v1-abstract-full').style.display = 'inline'; document.getElementById('2411.06437v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.06437v1-abstract-full" style="display: none;"> Contextual ASR or hotword customization holds substantial practical value. Despite the impressive performance of current end-to-end (E2E) automatic speech recognition (ASR) systems, they often face challenges in accurately recognizing rare words. Typical E2E contextual ASR models commonly feature complex architectures and decoding mechanisms, limited in performance and susceptible to interference from distractor words. With large language model (LLM)-based ASR models emerging as the new mainstream, we propose a CTC-Assisted LLM-Based Contextual ASR model with an efficient filtering algorithm. By using coarse CTC decoding results to filter potential relevant hotwords and incorporating them into LLM prompt input, our model attains WER/B-WER of 1.27%/3.67% and 2.72%/8.02% on the Librispeech test-clean and test-other sets targeting on recognizing rare long-tail words, demonstrating significant improvements compared to the baseline LLM-based ASR model, and substantially surpassing other related work. arXiv:2411.05261
Decoding Report Generators: A Cyclic Vision-Language Adapter for Counterfactual Explanations
Authors: Yingying Fang, Zihao Jin, Shaojie Guo, Jinda Liu, Yijian Gao, Junzhi Ning, Zhiling Yue, Zhi Li, Simon LF Walsh, Guang Yang
Abstract: Despite significant advancements in report generation methods, a critical limitation remains: the lack of interpretability in the generated text.
Submitted 7 November, 2024; originally announced November 2024. This paper introduces an innovative approach to enhance the explainability of text generated by report generation models. Our method employs cyclic text manipulation and visual comparison to identify and elucidate the features in the original content tha&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05261v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05261v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05261v1-abstract-full" style="display: none;"> Despite significant advancements in report generation methods, a critical limitation remains: the lack of interpretability in the generated text. This paper introduces an innovative approach to enhance the explainability of text generated by report generation models. Our method employs cyclic text manipulation and visual comparison to identify and elucidate the features in the original content that influence the generated text. By manipulating the generated reports and producing corresponding images, we create a comparative framework that highlights key attributes and their impact on the text generation process. This approach not only identifies the image features aligned to the generated text but also improves transparency but also provides deeper insights into the decision-making mechanisms of the report generation models. arXiv:2411.04399
ProGraph: Temporally-alignable Probability Guided Graph Topological Modeling for 3D Human Reconstruction
Authors: Hongsheng Wang, Zehui Feng, Tong Xiao, Genfan Yang, Shengyu Zhang, Fei Wu, Feng Lin
Abstract: Current 3D human motion reconstruction methods from monocular videos rely on features within the current reconstruction window, leading to distortion and deformations in the human structure under local occlusions or blurriness in video frames.
Submitted 6 November, 2024; originally announced November 2024. To estimate realistic 3D human mesh sequences based on incomplete features, we propose Temporally-alignable Probability Guided Graph Topological Modeling fo&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04399v1-abstract-full').style.display = 'inline'; document.getElementById('2411.04399v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.04399v1-abstract-full" style="display: none;"> Current 3D human motion reconstruction methods from monocular videos rely on features within the current reconstruction window, leading to distortion and deformations in the human structure under local occlusions or blurriness in video frames. To estimate realistic 3D human mesh sequences based on incomplete features, we propose Temporally-alignable Probability Guided Graph Topological Modeling for 3D Human Reconstruction (ProGraph). For missing parts recovery, we exploit the explicit topological-aware probability distribution across the entire motion sequence. To restore the complete human, Graph Topological Modeling (GTM) learns the underlying topological structure, focusing on the relationships inherent in the individual parts. Next, to generate blurred motion parts, Temporal-alignable Probability Distribution (TPDist) utilizes the GTM to predict features based on distribution. This interactive mechanism facilitates motion consistency, allowing the restoration of human parts. Furthermore, Hierarchical Human Loss (HHLoss) constrains the probability distribution errors of inter-frame features during topological structure variation. arXiv:2411.04387
Automated Update of Android Deprecated API Usages with Large Language Models
Authors: Tarek Mahmud, Bin Duan, Meiru Che, Awatif Yasmin, Anne H. H. Ngu, Guowei Yang
Abstract: Android apps rely on application programming interfaces (APIs) to access various functionalities of Android devices. These APIs however are regularly updated to incorporate new features while the old APIs get deprecated.
Submitted 6 November, 2024; originally announced November 2024. H. Ngu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guowei Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.04387v1-abstract-short" style="display: inline;"> Android apps rely on application programming interfaces (APIs) to access various functionalities of Android devices. These APIs however are regularly updated to incorporate new features while the old APIs get deprecated. Even though the importance of updating deprecated API usages with the recommended replacement APIs has been widely recognized, it is non-trivial to update the deprecated API usage&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.04387v1-abstract-full').style.display = 'inline'; document.getElementById('2411.04387v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.04387v1-abstract-full" style="display: none;"> Android apps rely on application programming interfaces (APIs) to access various functionalities of Android devices. These APIs however are regularly updated to incorporate new features while the old APIs get deprecated. Even though the importance of updating deprecated API usages with the recommended replacement APIs has been widely recognized, it is non-trivial to update the deprecated API usages. Therefore, the usages of deprecated APIs linger in Android apps and cause compatibility issues in practice. This paper introduces GUPPY, an automated approach that utilizes large language models (LLMs) to update Android deprecated API usages. By employing carefully crafted prompts, GUPPY leverages GPT-4, one of the most powerful LLMs, to update deprecated-API usages, ensuring compatibility in both the old and new API levels. Additionally, GUPPY uses GPT-4 to generate tests, identify incorrect updates, and refine the API usage through an iterative process until the tests pass or a specified limit is reached. arXiv:2411.03551
Enhancing Weakly Supervised Semantic Segmentation for Fibrosis via Controllable Image Generation
Authors: Zhiling Yue, Yingying Fang, Liutao Yang, Nikhil Baid, Simon Walsh, Guang Yang
Abstract: Fibrotic Lung Disease (FLD) is a severe condition marked by lung stiffening and scarring, leading to respiratory decline.
Submitted 5 November, 2024; originally announced November 2024. High-resolution computed tomography (HRCT) is critical for diagnosing and monitoring FLD; however, fibrosis appears as irregular, diffuse patterns with unclear boundaries, leading to high inter-observer variability and time-intensive manual annotation. To tackle this challenge,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03551v1-abstract-full').style.display = 'inline'; document.getElementById('2411.03551v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.03551v1-abstract-full" style="display: none;"> Fibrotic Lung Disease (FLD) is a severe condition marked by lung stiffening and scarring, leading to respiratory decline. High-resolution computed tomography (HRCT) is critical for diagnosing and monitoring FLD; however, fibrosis appears as irregular, diffuse patterns with unclear boundaries, leading to high inter-observer variability and time-intensive manual annotation. To tackle this challenge, we propose DiffSeg, a novel weakly supervised semantic segmentation (WSSS) method that uses image-level annotations to generate pixel-level fibrosis segmentation, reducing the need for fine-grained manual labeling. Additionally, our DiffSeg incorporates a diffusion-based generative model to synthesize HRCT images with different levels of fibrosis from healthy slices, enabling the generation of the fibrosis-injected slices and their paired fibrosis location. arXiv:2411.02941
A Mamba Foundation Model for Time Series Forecasting
Authors: Haoyu Ma, Yushu Chen, Wenlai Zhao, Jinzhe Yang, Yingsheng Ji, Xinghua Xu, Xiaozhu Liu, Hao Jing, Shengzhuo Liu, Guangwen Yang
Abstract: Time series foundation models have demonstrated strong performance in zero-shot learning, making them well-suited for predicting rapidly evolving patterns in real-world applications where relevant training data are scarce.
Submitted 5 November, 2024; originally announced November 2024. However, most of these models rely on the Transformer architecture, which incurs quadratic complexity as input length increases. To address this, we introduce TSMamba, a linear-&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.02941v1-abstract-full').style.display = 'inline'; document.getElementById('2411.02941v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.02941v1-abstract-full" style="display: none;"> Time series foundation models have demonstrated strong performance in zero-shot learning, making them well-suited for predicting rapidly evolving patterns in real-world applications where relevant training data are scarce. However, most of these models rely on the Transformer architecture, which incurs quadratic complexity as input length increases. To address this, we introduce TSMamba, a linear-complexity foundation model for time series forecasting built on the Mamba architecture. The model captures temporal dependencies through both forward and backward Mamba encoders, achieving high prediction accuracy. To reduce reliance on large datasets and lower training costs, TSMamba employs a two-stage transfer learning process that leverages pretrained Mamba LLMs, allowing effective time series modeling with a moderate training set. In the first stage, the forward and backward backbones are optimized via patch-wise autoregressive prediction; in the second stage, the model trains a prediction head and refines other components for long-term forecasting. While the backbone assumes channel independence to manage varying channel numbers across datasets, a channel-wise compressed attention module is introduced to capture cross-channel dependencies during fine-tuning on specific multivariate datasets. Experiments show that TSMamba&#39;s zero-shot performance is comparable to state-of-the-art time series foundation models, despite using significantly less training data. It also achieves competitive or superior full-shot performance compared to task-specific prediction models. arXiv:2411.01391
Differentiable Quantum Computing for Large-scale Linear Control
Authors: Connor Clayton, Jiaqi Leng, Gengzhi Yang, Yi-Ling Qiao, Ming C. Lin, Xiaodi Wu
Abstract: As industrial models and designs grow increasingly complex, the demand for optimal control of large-scale dynamical systems has significantly increased. However, traditional methods for optimal control incur significant overhead as problem dimensions grow. In this paper, we introduce an end-to-end quantum algorithm for linear-quadratic control with provable speedups.
Submitted 2 November, 2024; originally announced November 2024. Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+X">Xiaodi Wu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.01391v1-abstract-short" style="display: inline;"> As industrial models and designs grow increasingly complex, the demand for optimal control of large-scale dynamical systems has significantly increased. However, traditional methods for optimal control incur significant overhead as problem dimensions grow. In this paper, we introduce an end-to-end quantum algorithm for linear-quadratic control with provable speedups. Our algorithm, based on a poli&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.01391v1-abstract-full').style.display = 'inline'; document.getElementById('2411.01391v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.01391v1-abstract-full" style="display: none;"> As industrial models and designs grow increasingly complex, the demand for optimal control of large-scale dynamical systems has significantly increased. However, traditional methods for optimal control incur significant overhead as problem dimensions grow. In this paper, we introduce an end-to-end quantum algorithm for linear-quadratic control with provable speedups. Our algorithm, based on a policy gradient method, incorporates a novel quantum subroutine for solving the matrix Lyapunov equation. Specifically, we build a quantum-assisted differentiable simulator for efficient gradient estimation that is more accurate and robust than classical methods relying on stochastic approximation. Compared to the classical approaches, our method achieves a super-quadratic speedup. arXiv:2411.01172
Covariance-based Space Regularization for Few-shot Class Incremental Learning
Authors: Yijie Hu, Guanyu Yang, Zhaorui Tan, Xiaowei Huang, Kaizhu Huang, Qiu-Feng Wang
Abstract: Few-shot Class Incremental Learning (FSCIL) presents a challenging yet realistic scenario, which requires the model to continually learn new classes with limited labeled data (i.e., incremental sessions) while retaining knowledge of previously learned base classes (i.e., base sessions).
Submitted 2 November, 2024; originally announced November 2024.
Comments: WACV2025,10 pages, 5 figures Due to the limited data in incremental sessions, models are prone to overfitting new classes and suffering catas&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.01172v1-abstract-full').style.display = 'inline'; document.getElementById('2411.01172v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.01172v1-abstract-full" style="display: none;"> Few-shot Class Incremental Learning (FSCIL) presents a challenging yet realistic scenario, which requires the model to continually learn new classes with limited labeled data (i.e., incremental sessions) while retaining knowledge of previously learned base classes (i.e., base sessions). Due to the limited data in incremental sessions, models are prone to overfitting new classes and suffering catastrophic forgetting of base classes. To tackle these issues, recent advancements resort to prototype-based approaches to constrain the base class distribution and learn discriminative representations of new classes. Despite the progress, the limited data issue still induces ill-divided feature space, leading the model to confuse the new class with old classes or fail to facilitate good separation among new classes. In this paper, we aim to mitigate these issues by directly constraining the span of each class distribution from a covariance perspective. In detail, we propose a simple yet effective covariance constraint loss to force the model to learn each class distribution with the same covariance matrix. In addition, we propose a perturbation approach to perturb the few-shot training samples in the feature space, which encourages the samples to be away from the weighted distribution of other classes. Regarding perturbed samples as new class data, the classifier is forced to establish explicit boundaries between each new class and the existing ones. Our approach is easy to integrate into existing FSCIL approaches to boost performance. arXiv:2411.00114
Project Sid: Many-agent simulations toward AI civilization
Authors: Altera. AL, Andrew Ahn, Nic Becker, Stephanie Carroll, Nico Christie, Manuel Cortes, Arda Demirci, Melissa Du, Frankie Li, Shuying Luo, Peter Y Wang, Mathew Willows, Feitong Yang, Guangyu Robert Yang
Abstract: AI agents have been evaluated in isolation or within small groups, where interactions remain limited in scope and complexity.
Submitted 31 October, 2024; originally announced November 2024.
Comments: 35 pages, 14 figures AL</a>, <a href="/search/cs?searchtype=author&amp;query=Ahn%2C+A">Andrew Ahn</a>, <a href="/search/cs?searchtype=author&amp;query=Becker%2C+N">Nic Becker</a>, <a href="/search/cs?searchtype=author&amp;query=Carroll%2C+S">Stephanie Carroll</a>, <a href="/search/cs?searchtype=author&amp;query=Christie%2C+N">Nico Christie</a>, <a href="/search/cs?searchtype=author&amp;query=Cortes%2C+M">Manuel Cortes</a>, <a href="/search/cs?searchtype=author&amp;query=Demirci%2C+A">Arda Demirci</a>, <a href="/search/cs?searchtype=author&amp;query=Du%2C+M">Melissa Du</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+F">Frankie Li</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+S">Shuying Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+P+Y">Peter Y Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Willows%2C+M">Mathew Willows</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+F">Feitong Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G+R">Guangyu Robert Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.00114v1-abstract-short" style="display: inline;"> AI agents have been evaluated in isolation or within small groups, where interactions remain limited in scope and complexity. Large-scale simulations involving many autonomous agents -- reflecting the full spectrum of civilizational processes -- have yet to be explored. Here, we demonstrate how 10 - 1000+ AI agents behave and progress within agent societies. We first introduce the PIANO (Parallel&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00114v1-abstract-full').style.display = 'inline'; document.getElementById('2411.00114v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.00114v1-abstract-full" style="display: none;"> AI agents have been evaluated in isolation or within small groups, where interactions remain limited in scope and complexity. Large-scale simulations involving many autonomous agents -- reflecting the full spectrum of civilizational processes -- have yet to be explored. Here, we demonstrate how 10 - 1000+ AI agents behave and progress within agent societies. We first introduce the PIANO (Parallel Information Aggregation via Neural Orchestration) architecture, which enables agents to interact with humans and other agents in real-time while maintaining coherence across multiple output streams. We then evaluate agent performance in agent simulations using civilizational benchmarks inspired by human history. These simulations, set within a Minecraft environment, reveal that agents are capable of meaningful progress -- autonomously developing specialized roles, adhering to and changing collective rules, and engaging in cultural and religious transmission. arXiv:2411.00083
Learning Visual Parkour from Generated Images
Authors: Alan Yu, Ge Yang, Ran Choi, Yajvan Ravan, John Leonard, Phillip Isola
Abstract: Fast and accurate physics simulation is an essential component of robot learning, where robots can explore failure scenarios that are difficult to produce in the real world and learn from unlimited on-policy data.
Submitted 31 October, 2024; originally announced November 2024.
Comments: 17 pages, 19 figures Yet, it remains challenging to incorporate RGB-color perception into the sim-to-real pipeline that matches the real world in its richness and realism. In this work, we train a robot dog&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00083v1-abstract-full').style.display = 'inline'; document.getElementById('2411.00083v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.00083v1-abstract-full" style="display: none;"> Fast and accurate physics simulation is an essential component of robot learning, where robots can explore failure scenarios that are difficult to produce in the real world and learn from unlimited on-policy data. Yet, it remains challenging to incorporate RGB-color perception into the sim-to-real pipeline that matches the real world in its richness and realism. In this work, we train a robot dog in simulation for visual parkour. We propose a way to use generative models to synthesize diverse and physically accurate image sequences of the scene from the robot&#39;s ego-centric perspective. arXiv:2410.22793
Less is More: DocString Compression in Code Generation
Authors: Guang Yang, Yu Zhou, Wei Cheng, Xiangyu Zhang, Xiang Chen, Terry Yue Zhuo, Ke Liu, Xin Zhou, David Lo, Taolue Chen
Abstract: The widespread use of Large Language Models (LLMs) in software engineering has intensified the need for improved model and resource efficiency.
Submitted 31 October, 2024; v1 submitted 30 October, 2024; originally announced October 2024.
Comments: UNDER REVIEW In particular, for neural code generation, LLMs are used to translate function/method signature and DocString to executable code. DocStrings which capture user re quirements for the code and used as the prompt for LLMs, often contains redundant information&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.22793v2-abstract-full').style.display = 'inline'; document.getElementById('2410.22793v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.22793v2-abstract-full" style="display: none;"> The widespread use of Large Language Models (LLMs) in software engineering has intensified the need for improved model and resource efficiency. In particular, for neural code generation, LLMs are used to translate function/method signature and DocString to executable code. DocStrings which capture user re quirements for the code and used as the prompt for LLMs, often contains redundant information. Recent advancements in prompt compression have shown promising results in Natural Language Processing (NLP), but their applicability to code generation remains uncertain. Our empirical study show that the state-of-the-art prompt compression methods achieve only about 10% reduction, as further reductions would cause significant performance degradation. In our study, we propose a novel compression method, ShortenDoc, dedicated to DocString compression for code generation. Our extensive experiments on six code generation datasets, five open-source LLMs (1B to 10B parameters), and one closed-source LLM GPT-4o confirm that ShortenDoc achieves 25-40% compression while preserving the quality of generated code, outperforming other baseline methods at similar compression levels. arXiv:2410.21968
Automated Vulnerability Detection Using Deep Learning Technique
Authors: Guan-Yan Yang, Yi-Heng Ko, Farn Wang, Kuo-Hui Yeh, Haw-Shiang Chang, Hsueh-Yi Chen
Abstract: Our work explores the utilization of deep learning, specifically leveraging the CodeBERT model, to enhance code security testing for Python applications by detecting SQL injection vulnerabilities.
Submitted 29 October, 2024; originally announced October 2024.
Comments: 4 pages, 1 figures; Presented at The 30st International Conference on Computational & Experimental Engineering and Sciences (ICCES2024)
ACM Class: D.2.4; D.2.5 Unlike traditional security testing methods that may be slow and error-prone, our approach transforms source code into vector representations and trains a Long Short-Term Memory (LSTM) model to identify&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.21968v1-abstract-full').style.display = 'inline'; document.getElementById('2410.21968v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.21968v1-abstract-full" style="display: none;"> Our work explores the utilization of deep learning, specifically leveraging the CodeBERT model, to enhance code security testing for Python applications by detecting SQL injection vulnerabilities. Unlike traditional security testing methods that may be slow and error-prone, our approach transforms source code into vector representations and trains a Long Short-Term Memory (LSTM) model to identify vulnerable patterns. When compared with existing static application security testing (SAST) tools, our model displays superior performance, achieving higher precision, recall, and F1-score. arXiv:2410.21352
LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment
Authors: Ge Yang, Changyi He, Jinyang Guo, Jianyu Wu, Yifu Ding, Aishan Liu, Haotong Qin, Pengliang Ji, Xianglong Liu
Abstract: Although large language models (LLMs) have demonstrated their strong intelligence ability, the high demand for computation and storage hinders their practical application.
Submitted 31 October, 2024; v1 submitted 28 October, 2024; originally announced October 2024.
Comments: Accepted by NeurIPS 2024 Datasets and Benchmarks Track To this end, many model compression techniques are proposed to increase the efficiency of LLMs. However, current researches only validate their methods on limited models, datasets, metrics, etc, and still lack a comprehensive ev&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.21352v2-abstract-full').style.display = 'inline'; document.getElementById('2410.21352v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.21352v2-abstract-full" style="display: none;"> Although large language models (LLMs) have demonstrated their strong intelligence ability, the high demand for computation and storage hinders their practical application. To this end, many model compression techniques are proposed to increase the efficiency of LLMs. However, current researches only validate their methods on limited models, datasets, metrics, etc, and still lack a comprehensive evaluation under more general scenarios. So it is still a question of which model compression approach we should use under a specific case. To mitigate this gap, we present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms. We first analyze the actual model production requirements and carefully design evaluation tracks and metrics. Then, we conduct extensive experiments and comparison using multiple mainstream LLM compression approaches. Finally, we perform an in-depth analysis based on the evaluation and provide useful insight for LLM compression design. We hope our LLMCBench can contribute insightful suggestions for LLM compression algorithm design and serve as a foundation for future research. arXiv:2410.16726
Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
Authors: Guanrou Yang, Fan Yu, Ziyang Ma, Zhihao Du, Zhifu Gao, Shiliang Zhang, Xie Chen
Abstract: While automatic speech recognition (ASR) systems have achieved remarkable performance with large-scale datasets, their efficacy remains inadequate in low-resource settings, encompassing dialects, accents, minority languages, and long-tail hotwords, domains with significant practical relevance.
Submitted 22 October, 2024; originally announced October 2024. With the advent of versatile and powerful text-to-speech (TTS) models, capable of generating speech with&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.16726v1-abstract-full').style.display = 'inline'; document.getElementById('2410.16726v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.16726v1-abstract-full" style="display: none;"> While automatic speech recognition (ASR) systems have achieved remarkable performance with large-scale datasets, their efficacy remains inadequate in low-resource settings, encompassing dialects, accents, minority languages, and long-tail hotwords, domains with significant practical relevance. With the advent of versatile and powerful text-to-speech (TTS) models, capable of generating speech with human-level naturalness, expressiveness, and diverse speaker profiles, leveraging TTS for ASR data augmentation provides a cost-effective and practical approach to enhancing ASR performance. Comprehensive experiments on an unprecedentedly rich variety of low-resource datasets demonstrate consistent and substantial performance improvements, proving that the proposed method of enhancing low-resource ASR through a versatile TTS model is highly effective and has broad application prospects. Furthermore, we delve deeper into key characteristics of synthesized speech data that contribute to ASR improvement, examining factors such as text diversity, speaker diversity, and the volume of synthesized data, with text diversity being studied for the first time in this work. arXiv:2410.16259
Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos
Authors: Gengshan Yang, Andrea Bajcsy, Shunsuke Saito, Angjoo Kanazawa
Abstract: We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections.
Submitted 21 October, 2024; originally announced October 2024.
Comments: Project page: Different from prior works that rely on marker-based tracking and multiview cameras, ATS learns natural behaviors of animal and human agents non-invasively through video observations recorded over a long time-span (e.g., a month) in a single environment. Mode&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.16259v1-abstract-full').style.display = 'inline'; document.getElementById('2410.16259v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.16259v1-abstract-full" style="display: none;"> We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections. Different from prior works that rely on marker-based tracking and multiview cameras, ATS learns natural behaviors of animal and human agents non-invasively through video observations recorded over a long time-span (e.g., a month) in a single environment. Modeling 3D behavior of an agent requires persistent 3D tracking (e.g., knowing which point corresponds to which) over a long time period. To obtain such data, we develop a coarse-to-fine registration method that tracks the agent and the camera over time through a canonical 3D space, resulting in a complete and persistent spacetime 4D representation. We then train a generative model of agent behaviors using paired data of perception and motion of an agent queried from the 4D reconstruction. ATS enables real-to-sim transfer from video recordings of an agent to an interactive behavior simulator. arXiv:2410.13896
From Real Artifacts to Virtual Reference: A Robust Framework for Translating Endoscopic Images
Authors: Junyang Wu, Fangfang Xie, Jiayuan Sun, Yun Gu, Guang-Zhong Yang
Abstract: Domain adaptation, which bridges the distributions across different modalities, plays a crucial role in multimodal medical image analysis.
Submitted 23 October, 2024; v1 submitted 14 October, 2024; originally announced October 2024. In endoscopic imaging, combining pre-operative data with intra-operative imaging is important for surgical planning and navigation. However, existing domain adaptation methods are hampered by distribution shift caused by in vivo artifacts, necessitating robust&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13896v2-abstract-full').style.display = 'inline'; document.getElementById('2410.13896v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.13896v2-abstract-full" style="display: none;"> Domain adaptation, which bridges the distributions across different modalities, plays a crucial role in multimodal medical image analysis. In endoscopic imaging, combining pre-operative data with intra-operative imaging is important for surgical planning and navigation. However, existing domain adaptation methods are hampered by distribution shift caused by in vivo artifacts, necessitating robust techniques for aligning noisy and artifact abundant patient endoscopic videos with clean virtual images reconstructed from pre-operative tomographic data for pose estimation during intraoperative guidance. This paper presents an artifact-resilient image translation method and an associated benchmark for this purpose. The method incorporates a novel ``local-global&#39;&#39; translation framework and a noise-resilient feature extraction strategy. For the former, it decouples the image translation process into a local step for feature denoising, and a global step for global style transfer. For feature extraction, a new contrastive learning strategy is proposed, which can extract noise-resilient features for establishing robust correspondence across domains. arXiv:2410.13823
Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning
Authors: Xiaodan Xing, Junzhi Ning, Yang Nan, Guang Yang
Abstract: Deep generative models have significantly advanced medical imaging analysis by enhancing dataset size and quality.
Submitted 17 October, 2024; originally announced October 2024.
Comments: Accepted by AIM-FM Workshop of NeurIPS2024 Beyond mere data augmentation, our research in this paper highlights an additional, significant capacity of deep generative models: their ability to reveal and demonstrate patterns in medical images. We employ a generative structure with hybrid conditions, combining clinical data and&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13823v1-abstract-full').style.display = 'inline'; document.getElementById('2410.13823v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.13823v1-abstract-full" style="display: none;"> Deep generative models have significantly advanced medical imaging analysis by enhancing dataset size and quality. Beyond mere data augmentation, our research in this paper highlights an additional, significant capacity of deep generative models: their ability to reveal and demonstrate patterns in medical images. We employ a generative structure with hybrid conditions, combining clinical data and segmentation masks to guide the image synthesis process. Furthermore, we innovatively transformed the tabular clinical data into textual descriptions. This approach simplifies the handling of missing values and also enables us to leverage large pre-trained vision-language models that investigate the relations between independent clinical entries and comprehend general terms, such as gender and smoking status. Our approach differs from and presents a more challenging task than traditional medical report-guided synthesis due to the less visual correlation of our clinical information with the images. To overcome this, we introduce a text-visual embedding mechanism that strengthens the conditions, ensuring the network effectively utilizes the provided information. Our pipeline is generalizable to both GAN-based and diffusion models. arXiv:2410.13794
Arbitrarily-Conditioned Multi-Functional Diffusion for Multi-Physics Emulation
Authors: Da Long, Zhitong Xu, Guang Yang, Akil Narayan, Shandian Zhe
Abstract: Modern physics simulation often involves multiple functions of interests, and traditional numerical approaches are known to be complex and computationally costly. All codes are <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13823v1-abstract-full').style.display = 'none'; document.getElementById('2410.13823v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by AIM-FM Workshop of NeurIPS2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.13794</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Arbitrarily-Conditioned Multi-Functional Diffusion for Multi-Physics Emulation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Long%2C+D">Da Long</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+Z">Zhitong Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Narayan%2C+A">Akil Narayan</a>, <a href="/search/cs?searchtype=author&amp;query=Zhe%2C+S">Shandian Zhe</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.13794v1-abstract-short" style="display: inline;"> Modern physics simulation often involves multiple functions of interests, and traditional numerical approaches are known to be complex and computationally costly. While machine learning-based surrogate models can offer significant cost reductions, most focus on a single task, such as forward prediction, and typically lack uncertainty quantification -- an essential component in many applications. T&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13794v1-abstract-full').style.display = 'inline'; document.getElementById('2410.13794v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.13794v1-abstract-full" style="display: none;"> Modern physics simulation often involves multiple functions of interests, and traditional numerical approaches are known to be complex and computationally costly. While machine learning-based surrogate models can offer significant cost reductions, most focus on a single task, such as forward prediction, and typically lack uncertainty quantification -- an essential component in many applications. To overcome these limitations, we propose Arbitrarily-Conditioned Multi-Functional Diffusion (ACMFD), a versatile probabilistic surrogate model for multi-physics emulation. ACMFD can perform a wide range of tasks within a single framework, including forward prediction, various inverse problems, and simulating data for entire systems or subsets of quantities conditioned on others. Specifically, we extend the standard Denoising Diffusion Probabilistic Model (DDPM) for multi-functional generation by modeling noise as Gaussian processes (GP). We then introduce an innovative denoising loss. The training involves randomly sampling the conditioned part and fitting the corresponding predicted noise to zero, enabling ACMFD to flexibly generate function values conditioned on any other functions or quantities. To enable efficient training and sampling, and to flexibly handle irregularly sampled data, we use GPs to interpolate function samples onto a grid, inducing a Kronecker product structure for efficient computation. We demonstrate the advantages of ACMFD across several fundamental multi-physics systems. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13794v1-abstract-full').style.display = 'none'; document.getElementById('2410.13794v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.12750</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Comparative Analysis of Extrinsic Factors for NER in French </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Grace Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zhiyi Li</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yadong Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Park%2C+J">Jungyeul Park</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.12750v2-abstract-short" style="display: inline;"> Named entity recognition (NER) is a crucial task that aims to identify structured information, which is often replete with complex, technical terms and a high degree of variability. Accurate and reliable NER can facilitate the extraction and analysis of important information. However, NER for other than English is challenging due to limited data availability, as the high expertise, time, and expen&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.12750v2-abstract-full').style.display = 'inline'; document.getElementById('2410.12750v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.12750v2-abstract-full" style="display: none;"> Named entity recognition (NER) is a crucial task that aims to identify structured information, which is often replete with complex, technical terms and a high degree of variability. Accurate and reliable NER can facilitate the extraction and analysis of important information. However, NER for other than English is challenging due to limited data availability, as the high expertise, time, and expenses are required to annotate its data. In this paper, by using the limited data, we explore various factors including model structure, corpus annotation scheme and data augmentation techniques to improve the performance of a NER model for French. Our experiments demonstrate that these approaches can significantly improve the model&#39;s F1 score from original CRF score of 62.41 to 79.39. Our findings suggest that considering different extrinsic factors and combining these techniques is a promising approach for improving NER performance where the size of data is limited. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.12750v2-abstract-full').style.display = 'none'; document.getElementById('2410.12750v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.12588</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Distributed, Parallel, and Cluster Computing">cs.DC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Operating Systems">cs.OS</span> </div> </div> <p class="title is-5 mathjax"> FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wu%2C+T">Tianyuan Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+W">Wei Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+Y">Yinghao Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+S">Siran Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+W">Wenchao Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Duan%2C+Q">Qinkai Duan</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guodong Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+J">Jiamang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Qu%2C+L">Lin Qu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+L">Liping Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.12588v1-abstract-short" style="display: inline;"> Fail-slows, or stragglers, are common but largely unheeded problems in large-scale hybrid-parallel training that spans thousands of GPU servers and runs for weeks to months. Yet, these problems are not well studied, nor can they be quickly detected and effectively mitigated. In this paper, we first present a characterization study on a shared production cluster with over 10,000 GPUs1. We find that&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.12588v1-abstract-full').style.display = 'inline'; document.getElementById('2410.12588v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.12588v1-abstract-full" style="display: none;"> Fail-slows, or stragglers, are common but largely unheeded problems in large-scale hybrid-parallel training that spans thousands of GPU servers and runs for weeks to months. Yet, these problems are not well studied, nor can they be quickly detected and effectively mitigated. In this paper, we first present a characterization study on a shared production cluster with over 10,000 GPUs1. We find that fail-slows are caused by various CPU/GPU computation and cross-node networking issues, lasting from tens of seconds to nearly ten hours, and collectively delaying the average job completion time by 1.34%. The current practice is to manually detect these fail-slows and simply treat them as fail-stops using a checkpoint-and-restart failover approach, which are labor-intensive and time-consuming. In this paper, we propose FALCON, a framework that rapidly identifies fail-slowed GPUs and/or communication links, and effectively tackles them with a novel multi-level mitigation mechanism, all without human intervention. We have applied FALCON to detect human-labeled fail-slows in a production cluster with over 99% accuracy. Cluster deployment further demonstrates that FALCON effectively handles manually injected fail-slows, mitigating the training slowdown by 60.1%. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.12588v1-abstract-full').style.display = 'none'; document.getElementById('2410.12588v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 16 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">17 pages, 20 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.11165</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Toward Efficient Kernel-Based Solvers for Nonlinear PDEs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Xu%2C+Z">Zhitong Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Long%2C+D">Da Long</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+Y">Yiming Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhe%2C+S">Shandian Zhe</a>, <a href="/search/cs?searchtype=author&amp;query=Owhadi%2C+H">Houman Owhadi</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.11165v3-abstract-short" style="display: inline;"> This paper introduces a novel kernel learning framework toward efficiently solving nonlinear partial differential equations (PDEs). In contrast to the state-of-the-art kernel solver that embeds differential operators within kernels, posing challenges with a large number of collocation points, our approach eliminates these operators from the kernel. We model the solution using a standard kernel int&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.11165v3-abstract-full').style.display = 'inline'; document.getElementById('2410.11165v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.11165v3-abstract-full" style="display: none;"> This paper introduces a novel kernel learning framework toward efficiently solving nonlinear partial differential equations (PDEs). In contrast to the state-of-the-art kernel solver that embeds differential operators within kernels, posing challenges with a large number of collocation points, our approach eliminates these operators from the kernel. We model the solution using a standard kernel interpolation form and differentiate the interpolant to compute the derivatives. Our framework obviates the need for complex Gram matrix construction between solutions and their derivatives, allowing for a straightforward implementation and scalable computation. As an instance, we allocate the collocation points on a grid and adopt a product kernel, which yields a Kronecker product structure in the interpolation. This structure enables us to avoid computing the full Gram matrix, reducing costs and scaling efficiently to a large number of collocation points. We provide a proof of the convergence and rate analysis of our method under appropriate regularity assumptions. In numerical experiments, we demonstrate the advantages of our method in solving several benchmark PDEs. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.11165v3-abstract-full').style.display = 'none'; document.getElementById('2410.11165v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 14 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.10551</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Preserving Cardiac Integrity: A Topology-Infused Approach to Whole Heart Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+C">Chenyu Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Guan%2C+W">Wenxue Guan</a>, <a href="/search/cs?searchtype=author&amp;query=Xing%2C+X">Xiaodan Xing</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.10551v3-abstract-short" style="display: inline;"> Whole heart segmentation (WHS) supports cardiovascular disease (CVD) diagnosis, disease monitoring, treatment planning, and prognosis. Deep learning has become the most widely used method for WHS applications in recent years. However, segmentation of whole-heart structures faces numerous challenges including heart shape variability during the cardiac cycle, clinical artifacts like motion and poor&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.10551v3-abstract-full').style.display = 'inline'; document.getElementById('2410.10551v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.10551v3-abstract-full" style="display: none;"> Whole heart segmentation (WHS) supports cardiovascular disease (CVD) diagnosis, disease monitoring, treatment planning, and prognosis. Deep learning has become the most widely used method for WHS applications in recent years. However, segmentation of whole-heart structures faces numerous challenges including heart shape variability during the cardiac cycle, clinical artifacts like motion and poor contrast-to-noise ratio, domain shifts in multi-center data, and the distinct modalities of CT and MRI. To address these limitations and improve segmentation quality, this paper introduces a new topology-preserving module that is integrated into deep neural networks. The implementation achieves anatomically plausible segmentation by using learned topology-preserving fields, which are based entirely on 3D convolution and are therefore very effective for 3D voxel data. We incorporate natural constraints between structures into the end-to-end training and enrich the feature representation of the neural network. The effectiveness of the proposed method is validated on an open-source medical heart dataset, specifically using the WHS++ data. The results demonstrate that the architecture performs exceptionally well, achieving a Dice coefficient of 0.939 during testing. This indicates full topology preservation for individual structures and significantly outperforms other baselines in preserving the overall scene topology. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.10551v3-abstract-full').style.display = 'none'; document.getElementById('2410.10551v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 14 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.08473</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Deeper Insights into Deep Graph Convolutional Networks: Stability and Generalization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guangrui Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+M">Ming Li</a>, <a href="/search/cs?searchtype=author&amp;query=Feng%2C+H">Han Feng</a>, <a href="/search/cs?searchtype=author&amp;query=Zhuang%2C+X">Xiaosheng Zhuang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.08473v1-abstract-short" style="display: inline;"> Graph convolutional networks (GCNs) have emerged as powerful models for graph learning tasks, exhibiting promising performance in various domains. While their empirical success is evident, there is a growing need to understand their essential ability from a theoretical perspective. Existing theoretical research has primarily focused on the analysis of single-layer GCNs, while a comprehensive theor&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.08473v1-abstract-full').style.display = 'inline'; document.getElementById('2410.08473v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.08473v1-abstract-full" style="display: none;"> Graph convolutional networks (GCNs) have emerged as powerful models for graph learning tasks, exhibiting promising performance in various domains. While their empirical success is evident, there is a growing need to understand their essential ability from a theoretical perspective. Existing theoretical research has primarily focused on the analysis of single-layer GCNs, while a comprehensive theoretical exploration of the stability and generalization of deep GCNs remains limited. In this paper, we bridge this gap by delving into the stability and generalization properties of deep GCNs, aiming to provide valuable insights by characterizing rigorously the associated upper bounds. Our theoretical results reveal that the stability and generalization of deep GCNs are influenced by certain key factors, such as the maximum absolute eigenvalue of the graph filter operators and the depth of the network. Our theoretical studies contribute to a deeper understanding of the stability and generalization properties of deep GCNs, potentially paving the way for developing more reliable and well-performing models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.08473v1-abstract-full').style.display = 'none'; document.getElementById('2410.08473v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">44 pages, 3 figures, submitted to IEEE Trans. Pattern Anal. Mach. Intell. on 18-Jun-2024, under review</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.02786</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Graphics">cs.GR</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1145/3680528.3687682 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Robust Symmetry Detection via Riemannian Langevin Dynamics </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Je%2C+J">Jihyeon Je</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+J">Jiayi Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guandao Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Deng%2C+B">Boyang Deng</a>, <a href="/search/cs?searchtype=author&amp;query=Cai%2C+S">Shengqu Cai</a>, <a href="/search/cs?searchtype=author&amp;query=Wetzstein%2C+G">Gordon Wetzstein</a>, <a href="/search/cs?searchtype=author&amp;query=Litany%2C+O">Or Litany</a>, <a href="/search/cs?searchtype=author&amp;query=Guibas%2C+L">Leonidas Guibas</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.02786v1-abstract-short" style="display: inline;"> Symmetries are ubiquitous across all kinds of objects, whether in nature or in man-made creations. While these symmetries may seem intuitive to the human eye, detecting them with a machine is nontrivial due to the vast search space. Classical geometry-based methods work by aggregating &#34;votes&#34; for each symmetry but struggle with noise. In contrast, learning-based methods may be more robust to noise&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.02786v1-abstract-full').style.display = 'inline'; document.getElementById('2410.02786v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.02786v1-abstract-full" style="display: none;"> Symmetries are ubiquitous across all kinds of objects, whether in nature or in man-made creations. While these symmetries may seem intuitive to the human eye, detecting them with a machine is nontrivial due to the vast search space. Classical geometry-based methods work by aggregating &#34;votes&#34; for each symmetry but struggle with noise. In contrast, learning-based methods may be more robust to noise, but often overlook partial symmetries due to the scarcity of annotated data. In this work, we address this challenge by proposing a novel symmetry detection method that marries classical symmetry detection techniques with recent advances in generative modeling. Specifically, we apply Langevin dynamics to a redefined symmetry space to enhance robustness against noise. We provide empirical results on a variety of shapes that suggest our method is not only robust to noise, but can also identify both partial and global symmetries. Moreover, we demonstrate the utility of our detected symmetries in various downstream tasks, such as compression and symmetrization of noisy shapes. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.02786v1-abstract-full').style.display = 'none'; document.getElementById('2410.02786v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.02113</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Numerical Analysis">math.NA</span> </div> </div> <p class="title is-5 mathjax"> Mamba Neural Operator: Who Wins? Transformers vs. State-Space Models for PDEs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Cheng%2C+C">Chun-Wun Cheng</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+J">Jiahao Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Yi Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Sch%C3%B6nlieb%2C+C">Carola-Bibiane Sch枚nlieb</a>, <a href="/search/cs?searchtype=author&amp;query=Aviles-Rivero%2C+A+I">Angelica I Aviles-Rivero</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.02113v1-abstract-short" style="display: inline;"> Partial differential equations (PDEs) are widely used to model complex physical systems, but solving them efficiently remains a significant challenge. Recently, Transformers have emerged as the preferred architecture for PDEs due to their ability to capture intricate dependencies. However, they struggle with representing continuous dynamics and long-range interactions. To overcome these limitation&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.02113v1-abstract-full').style.display = 'inline'; document.getElementById('2410.02113v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.02113v1-abstract-full" style="display: none;"> Partial differential equations (PDEs) are widely used to model complex physical systems, but solving them efficiently remains a significant challenge. Recently, Transformers have emerged as the preferred architecture for PDEs due to their ability to capture intricate dependencies. However, they struggle with representing continuous dynamics and long-range interactions. To overcome these limitations, we introduce the Mamba Neural Operator (MNO), a novel framework that enhances neural operator-based techniques for solving PDEs. MNO establishes a formal theoretical connection between structured state-space models (SSMs) and neural operators, offering a unified structure that can adapt to diverse architectures, including Transformer-based models. By leveraging the structured design of SSMs, MNO captures long-range dependencies and continuous dynamics more effectively than traditional Transformers. Through extensive analysis, we show that MNO significantly boosts the expressive power and accuracy of neural operators, making it not just a complement but a superior framework for PDE-related tasks, bridging the gap between efficient representation and accurate solution approximation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.02113v1-abstract-full').style.display = 'none'; document.getElementById('2410.02113v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.20563</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> DressRecon: Freeform 4D Human Reconstruction from Monocular Video </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Tan%2C+J">Jeff Tan</a>, <a href="/search/cs?searchtype=author&amp;query=Xiang%2C+D">Donglai Xiang</a>, <a href="/search/cs?searchtype=author&amp;query=Tulsiani%2C+S">Shubham Tulsiani</a>, <a href="/search/cs?searchtype=author&amp;query=Ramanan%2C+D">Deva Ramanan</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Gengshan Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.20563v2-abstract-short" style="display: inline;"> We present a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions. Prior work in human reconstruction is either limited to tight clothing with no object interactions, or requires calibrated multi-view captures or personalized template scans which are costly to collect at scale. Our key insight for high-q&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.20563v2-abstract-full').style.display = 'inline'; document.getElementById('2409.20563v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.20563v2-abstract-full" style="display: none;"> We present a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions. Prior work in human reconstruction is either limited to tight clothing with no object interactions, or requires calibrated multi-view captures or personalized template scans which are costly to collect at scale. Our key insight for high-quality yet flexible reconstruction is the careful combination of generic human priors about articulated body shape (learned from large-scale training data) with video-specific articulated &#34;bag-of-bones&#34; deformation (fit to a single video via test-time optimization). We accomplish this by learning a neural implicit model that disentangles body versus clothing deformations as separate motion model layers. To capture subtle geometry of clothing, we leverage image-based priors such as human body pose, surface normals, and optical flow during optimization. The resulting neural fields can be extracted into time-consistent meshes, or further optimized as explicit 3D Gaussians for high-fidelity interactive rendering. On datasets with highly challenging clothing deformations and object interactions, DressRecon yields higher-fidelity 3D reconstructions than prior art. Project page: <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.20563v2-abstract-full').style.display = 'none'; document.getElementById('2409.20563v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 30 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Project page:</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.18333</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Neurons and Cognition">q-bio.NC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> A Framework for Standardizing Similarity Measures in a Rapidly Evolving Field </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Cloos%2C+N">Nathan Cloos</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G+R">Guangyu Robert Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Cueva%2C+C+J">Christopher J. Cueva</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.18333v1-abstract-short" style="display: inline;"> Similarity measures are fundamental tools for quantifying the alignment between artificial and biological systems. However, the diversity of similarity measures and their varied naming and implementation conventions makes it challenging to compare across studies. To facilitate comparisons and make explicit the implementation choices underlying a given code package, we have created and are continui&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.18333v1-abstract-full').style.display = 'inline'; document.getElementById('2409.18333v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.18333v1-abstract-full" style="display: none;"> Similarity measures are fundamental tools for quantifying the alignment between artificial and biological systems. However, the diversity of similarity measures and their varied naming and implementation conventions makes it challenging to compare across studies. To facilitate comparisons and make explicit the implementation choices underlying a given code package, we have created and are continuing to develop a Python repository that benchmarks and standardizes similarity measures. The goal of creating a consistent naming convention that uniquely and efficiently specifies a similarity measure is not trivial as, for example, even commonly used methods like Centered Kernel Alignment (CKA) have at least 12 different variations, and this number will likely continue to grow as the field evolves. For this reason, we do not advocate for a fixed, definitive naming convention. The landscape of similarity measures and best practices will continue to change and so we see our current repository, which incorporates approximately 100 different similarity measures from 14 packages, as providing a useful tool at this snapshot in time. To accommodate the evolution of the field we present a framework for developing, validating, and refining naming conventions with the goal of uniquely and efficiently specifying similarity measures, ultimately making it easier for the community to make comparisons across studies. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.18333v1-abstract-full').style.display = 'none'; document.getElementById('2409.18333v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 26 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">11 pages, 9 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.17725</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Robotics">cs.RO</span> </div> </div> <p class="title is-5 mathjax"> Stable Object Placement Under Geometric Uncertainty via Differentiable Contact Dynamics </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+L">Linfeng Li</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Gang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Shao%2C+L">Lin Shao</a>, <a href="/search/cs?searchtype=author&amp;query=Hsu%2C+D">David Hsu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.17725v1-abstract-short" style="display: inline;"> From serving a cup of coffee to carefully rearranging delicate items, stable object placement is a crucial skill for future robots. This skill is challenging due to the required accuracy, which is difficult to achieve under geometric uncertainty. We leverage differentiable contact dynamics to develop a principled method for stable object placement under geometric uncertainty. We estimate the geome&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.17725v1-abstract-full').style.display = 'inline'; document.getElementById('2409.17725v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.17725v1-abstract-full" style="display: none;"> From serving a cup of coffee to carefully rearranging delicate items, stable object placement is a crucial skill for future robots. This skill is challenging due to the required accuracy, which is difficult to achieve under geometric uncertainty. We leverage differentiable contact dynamics to develop a principled method for stable object placement under geometric uncertainty. We estimate the geometric uncertainty by minimizing the discrepancy between the force-torque sensor readings and the model predictions through gradient descent. We further keep track of a belief over multiple possible geometric parameters to mitigate the gradient-based method&#39;s sensitivity to the initialization. We verify our approach in the real world on various geometric uncertainties, including the in-hand pose uncertainty of the grasped object, the object&#39;s shape uncertainty, and the environment&#39;s shape uncertainty. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.17725v1-abstract-full').style.display = 'none'; document.getElementById('2409.17725v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 26 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.17431</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> On Extending Direct Preference Optimization to Accommodate Ties </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+J">Jinghong Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guangyu Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Lin%2C+W">Weizhe Lin</a>, <a href="/search/cs?searchtype=author&amp;query=Mei%2C+J">Jingbiao Mei</a>, <a href="/search/cs?searchtype=author&amp;query=Byrne%2C+B">Bill Byrne</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.17431v1-abstract-short" style="display: inline;"> We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly l&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.17431v1-abstract-full').style.display = 'inline'; document.getElementById('2409.17431v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.17431v1-abstract-full" style="display: none;"> We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. These findings motivate and enable the inclusion of tied pairs in preference optimization as opposed to simply discarding them. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.17431v1-abstract-full').style.display = 'none'; document.getElementById('2409.17431v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">24 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.16803</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+R">Ruoyu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Niu%2C+S">Shutong Niu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Gaobin Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Du%2C+J">Jun Du</a>, <a href="/search/cs?searchtype=author&amp;query=Qian%2C+S">Shuangqing Qian</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+T">Tian Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Pan%2C+J">Jia Pan</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.16803v1-abstract-short" style="display: inline;"> Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically, modular speaker diarization methods have seldom discussed how to leverage spatial cues from multi-channel speech. This paper proposes a three-stage modular system&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.16803v1-abstract-full').style.display = 'inline'; document.getElementById('2409.16803v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.16803v1-abstract-full" style="display: none;"> Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically, modular speaker diarization methods have seldom discussed how to leverage spatial cues from multi-channel speech. This paper proposes a three-stage modular system to enhance single-channel neural speaker diarization systems and recognition performance by utilizing spatial cues from multi-channel speech to provide more accurate initialization for each stage of neural speaker diarization (NSD) decoding: (1) Overlap detection and continuous speech separation (CSS) on multi-channel speech are used to obtain cleaner single speaker speech segments for clustering, followed by the first NSD decoding pass. (2) The results from the first pass initialize a complex Angular Central Gaussian Mixture Model (cACGMM) to estimate speaker-wise masks on multi-channel speech, and through Overlap-add and Mask-to-VAD, achieve initialization with lower speaker error (SpkErr), followed by the second NSD decoding pass. (3) The second decoding results are used for guided source separation (GSS), recognizing and filtering short segments containing less one word to obtain cleaner speech segments, followed by re-clustering and the final NSD decoding pass. We presented the progressively explored evaluation results from the CHiME-8 NOTSOFAR-1 (Natural Office Talkers in Settings Of Far-field Audio Recordings) challenge, demonstrating the effectiveness of our system and its contribution to improving recognition performance. Our final system achieved the first place in the challenge. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.16803v1-abstract-full').style.display = 'none'; document.getElementById('2409.16803v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">5 pages, Submitted to ICASSP 2025</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.16183</a> <span>&nbsp;[<a href="">pdf</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Expert-level vision-language foundation model for real-world radiology and comprehensive evaluation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liu%2C+X">Xiaohong Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guoxing Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+Y">Yulin Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Mao%2C+J">Jiaji Mao</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xiang Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+M">Ming Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+S">Shanghang Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+J">Jun Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+G">Guangyu Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.16183v1-abstract-short" style="display: inline;"> Radiology is a vital and complex component of modern clinical workflow and covers many tasks. Recently, vision-language (VL) foundation models in medicine have shown potential in processing multimodal information, offering a unified solution for various radiology tasks. However, existing studies either pre-trained VL models on natural data or did not fully integrate vision-language architecture an&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.16183v1-abstract-full').style.display = 'inline'; document.getElementById('2409.16183v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.16183v1-abstract-full" style="display: none;"> Radiology is a vital and complex component of modern clinical workflow and covers many tasks. Recently, vision-language (VL) foundation models in medicine have shown potential in processing multimodal information, offering a unified solution for various radiology tasks. However, existing studies either pre-trained VL models on natural data or did not fully integrate vision-language architecture and pretraining, often neglecting the unique multimodal complexity in radiology images and their textual contexts. Additionally, their practical applicability in real-world scenarios remains underexplored. Here, we present RadFound, a large and open-source vision-language foundation model tailored for radiology, that is trained on the most extensive dataset of over 8.1 million images and 250,000 image-text pairs, covering 19 major organ systems and 10 imaging modalities. To establish expert-level multimodal perception and generation capabilities, RadFound introduces an enhanced vision encoder to capture intra-image local features and inter-image contextual information, and a unified cross-modal learning design tailored to radiology. To fully assess the models&#39; capability, we construct a benchmark, RadVLBench, including radiology interpretation tasks like medical vision-language question-answering, as well as text generation tasks ranging from captioning to report generation. We also propose a human evaluation framework. When evaluated on the real-world benchmark involving three representative modalities, 2D images (chest X-rays), multi-view images (mammograms), and 3D images (thyroid CT scans), RadFound significantly outperforms other VL foundation models on both quantitative metrics and human evaluation. In summary, the development of RadFound represents an advancement in radiology generalists, demonstrating broad applicability potential for integration into clinical workflows. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.16183v1-abstract-full').style.display = 'none'; document.getElementById('2409.16183v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.15394</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Graphics">cs.GR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Numerical Analysis">math.NA</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1145/3641519.3657395 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Neural Control Variates with Automatic Integration </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+Z">Zilu Li</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guandao Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+Q">Qingqing Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Deng%2C+X">Xi Deng</a>, <a href="/search/cs?searchtype=author&amp;query=Guibas%2C+L">Leonidas Guibas</a>, <a href="/search/cs?searchtype=author&amp;query=Hariharan%2C+B">Bharath Hariharan</a>, <a href="/search/cs?searchtype=author&amp;query=Wetzstein%2C+G">Gordon Wetzstein</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.15394v1-abstract-short" style="display: inline;"> This paper presents a method to leverage arbitrary neural network architecture for control variates. Control variates are crucial in reducing the variance of Monte Carlo integration, but they hinge on finding a function that both correlates with the integrand and has a known analytical integral. Traditional approaches rely on heuristics to choose this function, which might not be expressive enough&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.15394v1-abstract-full').style.display = 'inline'; document.getElementById('2409.15394v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.15394v1-abstract-full" style="display: none;"> This paper presents a method to leverage arbitrary neural network architecture for control variates. Control variates are crucial in reducing the variance of Monte Carlo integration, but they hinge on finding a function that both correlates with the integrand and has a known analytical integral. Traditional approaches rely on heuristics to choose this function, which might not be expressive enough to correlate well with the integrand. Recent research alleviates this issue by modeling the integrands with a learnable parametric model, such as a neural network. However, the challenge remains in creating an expressive parametric model with a known analytical integral. This paper proposes a novel approach to construct learnable parametric control variates functions from arbitrary neural network architectures. Instead of using a network to approximate the integrand directly, we employ the network to approximate the anti-derivative of the integrand. This allows us to use automatic differentiation to create a function whose integration can be constructed by the antiderivative network. We apply our method to solve partial differential equations using the Walk-on-sphere algorithm. Our results indicate that this approach is unbiased and uses various network architectures to achieve lower variance than other control variate methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.15394v1-abstract-full').style.display = 'none'; document.getElementById('2409.15394v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 23 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> SIGGRAPH Conference Papers 2024 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.10289</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> ReflectDiffu:Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yuan%2C+J">Jiahao Yuan</a>, <a href="/search/cs?searchtype=author&amp;query=Di%2C+Z">Zixiang Di</a>, <a href="/search/cs?searchtype=author&amp;query=Cui%2C+Z">Zhiqing Cui</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guisong Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Naseem%2C+U">Usman Naseem</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.10289v2-abstract-short" style="display: inline;"> Empathetic response generation necessitates the integration of emotional and intentional dynamics to foster meaningful interactions. Existing research either neglects the intricate interplay between emotion and intent, leading to suboptimal controllability of empathy, or resorts to large language models (LLMs), which incur significant computational overhead. In this paper, we introduce ReflectDiff&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.10289v2-abstract-full').style.display = 'inline'; document.getElementById('2409.10289v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.10289v2-abstract-full" style="display: none;"> Empathetic response generation necessitates the integration of emotional and intentional dynamics to foster meaningful interactions. Existing research either neglects the intricate interplay between emotion and intent, leading to suboptimal controllability of empathy, or resorts to large language models (LLMs), which incur significant computational overhead. In this paper, we introduce ReflectDiffu, a lightweight and comprehensive framework for empathetic response generation. This framework incorporates emotion contagion to augment emotional expressiveness and employs an emotion-reasoning mask to pinpoint critical emotional elements. Additionally, it integrates intent mimicry within reinforcement learning for refinement during diffusion. By harnessing an intent twice reflect the mechanism of Exploring-Sampling-Correcting, ReflectDiffu adeptly translates emotional decision-making into precise intent actions, thereby addressing empathetic response misalignments stemming from emotional misrecognition. Through reflection, the framework maps emotional states to intents, markedly enhancing both response empathy and flexibility. Comprehensive experiments reveal that ReflectDiffu outperforms existing models regarding relevance, controllability, and informativeness, achieving state-of-the-art results in both automatic and human evaluations. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.10289v2-abstract-full').style.display = 'none'; document.getElementById('2409.10289v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.09727</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> From Challenges and Pitfalls to Recommendations and Opportunities: Implementing Federated Learning in Healthcare </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Li%2C+M">Ming Li</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+P">Pengcheng Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Hu%2C+J">Junjie Hu</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+Z">Zeyu Tang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.09727v1-abstract-short" style="display: inline;"> Federated learning holds great potential for enabling large-scale healthcare research and collaboration across multiple centres while ensuring data privacy and security are not compromised. Although numerous recent studies suggest or utilize federated learning based methods in healthcare, it remains unclear which ones have potential clinical utility. This review paper considers and analyzes the mo&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.09727v1-abstract-full').style.display = 'inline'; document.getElementById('2409.09727v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.09727v1-abstract-full" style="display: none;"> Federated learning holds great potential for enabling large-scale healthcare research and collaboration across multiple centres while ensuring data privacy and security are not compromised. Although numerous recent studies suggest or utilize federated learning based methods in healthcare, it remains unclear which ones have potential clinical utility. This review paper considers and analyzes the most recent studies up to May 2024 that describe federated learning based methods in healthcare. After a thorough review, we find that the vast majority are not appropriate for clinical use due to their methodological flaws and/or underlying biases which include but are not limited to privacy concerns, generalization issues, and communication costs. As a result, the effectiveness of federated learning in healthcare is significantly compromised. To overcome these challenges, we provide recommendations and promising opportunities that might be implemented to resolve these problems and improve the quality of model development in federated learning with healthcare. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.09727v1-abstract-full').style.display = 'none'; document.getElementById('2409.09727v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.09646</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> A Simple HMM with Self-Supervised Representations for Phone Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Gene-Ping Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Tang%2C+H">Hao Tang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.09646v2-abstract-short" style="display: inline;"> Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.09646v2-abstract-full').style.display = 'inline'; document.getElementById('2409.09646v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.09646v2-abstract-full" style="display: none;"> Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches. Based on this finding, we propose a simple hidden Markov model that uses self-supervised representations and features at the boundaries for phone segmentation. Our results demonstrate consistent improvements over previous approaches, with a generalized formulation allowing versatile design adaptations. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.09646v2-abstract-full').style.display = 'none'; document.getElementById('2409.09646v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted to SLT 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.09291</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Infrared and Visible Image Fusion with Hierarchical Human Perception </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+J">Jie Li</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+X">Xin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhong%2C+Z">Zhusi Zhong</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+X">Xinbo Gao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.09291v1-abstract-short" style="display: inline;"> Image fusion combines images from multiple domains into one image, containing complementary information from source domains. Existing methods take pixel intensity, texture and high-level vision task information as the standards to determine preservation of information, lacking enhancement for human perception. We introduce an image fusion method, Hierarchical Perception Fusion (HPFusion), which le&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.09291v1-abstract-full').style.display = 'inline'; document.getElementById('2409.09291v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.09291v1-abstract-full" style="display: none;"> Image fusion combines images from multiple domains into one image, containing complementary information from source domains. Existing methods take pixel intensity, texture and high-level vision task information as the standards to determine preservation of information, lacking enhancement for human perception. We introduce an image fusion method, Hierarchical Perception Fusion (HPFusion), which leverages Large Vision-Language Model to incorporate hierarchical human semantic priors, preserving complementary information that satisfies human visual system. We propose multiple questions that humans focus on when viewing an image pair, and answers are generated via the Large Vision-Language Model according to images. The texts of answers are encoded into the fusion network, and the optimization also aims to guide the human semantic distribution of the fused image more similarly to source images, exploring complementary information within the human perception domain. Extensive experiments demonstrate our HPFusoin can achieve high-quality fusion results both for information preservation and human visual enhancement. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.09291v1-abstract-full').style.display = 'none'; document.getElementById('2409.09291v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.04356</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Serp-Mamba: Advancing High-Resolution Retinal Vessel Segmentation with Selective State-Space Model </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wang%2C+H">Hongqiu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+Y">Yixian Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+W">Wu Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+H">Huihui Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhao%2C+H">Haoyu Zhao</a>, <a href="/search/cs?searchtype=author&amp;query=Sheng%2C+B">Bin Sheng</a>, <a href="/search/cs?searchtype=author&amp;query=Fu%2C+H">Huazhu Fu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhu%2C+L">Lei Zhu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.04356v2-abstract-short" style="display: inline;"> Ultra-Wide-Field Scanning Laser Ophthalmoscopy (UWF-SLO) images capture high-resolution views of the retina with typically 200 spanning degrees. Accurate segmentation of vessels in UWF-SLO images is essential for detecting and diagnosing fundus disease. Recent studies have revealed that the selective State Space Model (SSM) in Mamba performs well in modeling long-range dependencies, which is cruci&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.04356v2-abstract-full').style.display = 'inline'; document.getElementById('2409.04356v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.04356v2-abstract-full" style="display: none;"> Ultra-Wide-Field Scanning Laser Ophthalmoscopy (UWF-SLO) images capture high-resolution views of the retina with typically 200 spanning degrees. Accurate segmentation of vessels in UWF-SLO images is essential for detecting and diagnosing fundus disease. Recent studies have revealed that the selective State Space Model (SSM) in Mamba performs well in modeling long-range dependencies, which is crucial for capturing the continuity of elongated vessel structures. Inspired by this, we propose the first Serpentine Mamba (Serp-Mamba) network to address this challenging task. Specifically, we recognize the intricate, varied, and delicate nature of the tubular structure of vessels. Furthermore, the high-resolution of UWF-SLO images exacerbates the imbalance between the vessel and background categories. Based on the above observations, we first devise a Serpentine Interwoven Adaptive (SIA) scan mechanism, which scans UWF-SLO images along curved vessel structures in a snake-like crawling manner. This approach, consistent with vascular texture transformations, ensures the effective and continuous capture of curved vascular structure features. Second, we propose an Ambiguity-Driven Dual Recalibration (ADDR) module to address the category imbalance problem intensified by high-resolution images. Our ADDR module delineates pixels by two learnable thresholds and refines ambiguous pixels through a dual-driven strategy, thereby accurately distinguishing vessels and background regions. Experiment results on three datasets demonstrate the superior performance of our Serp-Mamba on high-resolution vessel segmentation. We also conduct a series of ablation studies to verify the impact of our designs. Our code shall be released upon publication of this work. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.04356v2-abstract-full').style.display = 'none'; document.getElementById('2409.04356v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 6 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.03087</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Coupling AI and Citizen Science in Creation of Enhanced Training Dataset for Medical Image Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Syahmi%2C+A">Amir Syahmi</a>, <a href="/search/cs?searchtype=author&amp;query=Lu%2C+X">Xiangrong Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Y">Yinxuan Li</a>, <a href="/search/cs?searchtype=author&amp;query=Yao%2C+H">Haoxuan Yao</a>, <a href="/search/cs?searchtype=author&amp;query=Jiang%2C+H">Hanjun Jiang</a>, <a href="/search/cs?searchtype=author&amp;query=Acharya%2C+I">Ishita Acharya</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+S">Shiyi Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Nan%2C+Y">Yang Nan</a>, <a href="/search/cs?searchtype=author&amp;query=Xing%2C+X">Xiaodan Xing</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.03087v1-abstract-short" style="display: inline;"> Recent advancements in medical imaging and artificial intelligence (AI) have greatly enhanced diagnostic capabilities, but the development of effective deep learning (DL) models is still constrained by the lack of high-quality annotated datasets. The traditional manual annotation process by medical experts is time- and resource-intensive, limiting the scalability of these datasets. In this work, w&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.03087v1-abstract-full').style.display = 'inline'; document.getElementById('2409.03087v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.03087v1-abstract-full" style="display: none;"> Recent advancements in medical imaging and artificial intelligence (AI) have greatly enhanced diagnostic capabilities, but the development of effective deep learning (DL) models is still constrained by the lack of high-quality annotated datasets. The traditional manual annotation process by medical experts is time- and resource-intensive, limiting the scalability of these datasets. In this work, we introduce a robust and versatile framework that combines AI and crowdsourcing to improve both the quality and quantity of medical image datasets across different modalities. Our approach utilises a user-friendly online platform that enables a diverse group of crowd annotators to label medical images efficiently. By integrating the MedSAM segmentation AI with this platform, we accelerate the annotation process while maintaining expert-level quality through an algorithm that merges crowd-labelled images. Additionally, we employ pix2pixGAN, a generative AI model, to expand the training dataset with synthetic images that capture realistic morphological features. These methods are combined into a cohesive framework designed to produce an enhanced dataset, which can serve as a universal pre-processing pipeline to boost the training of any medical deep learning segmentation model. Our results demonstrate that this framework significantly improves model performance, especially when training data is limited. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.03087v1-abstract-full').style.display = 'none'; document.getElementById('2409.03087v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.02070</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Explicit Differentiable Slicing and Global Deformation for Cardiac Mesh Reconstruction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Luo%2C+Y">Yihao Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Sesia%2C+D">Dario Sesia</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+F">Fanwen Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+Y">Yinzhe Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Ding%2C+W">Wenhao Ding</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+J">Jiahao Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+F">Fadong Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Shah%2C+A">Anoop Shah</a>, <a href="/search/cs?searchtype=author&amp;query=Kaural%2C+A">Amit Kaural</a>, <a href="/search/cs?searchtype=author&amp;query=Mayet%2C+J">Jamil Mayet</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Yap%2C+C">ChoonHwai Yap</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.02070v2-abstract-short" style="display: inline;"> Mesh reconstruction of the cardiac anatomy from medical images is useful for shape and motion measurements and biophysics simulations to facilitate the assessment of cardiac function and health. However, 3D medical images are often acquired as 2D slices that are sparsely sampled and noisy, and mesh reconstruction on such data is a challenging task. Traditional voxel-based approaches rely on pre- a&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.02070v2-abstract-full').style.display = 'inline'; document.getElementById('2409.02070v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.02070v2-abstract-full" style="display: none;"> Mesh reconstruction of the cardiac anatomy from medical images is useful for shape and motion measurements and biophysics simulations to facilitate the assessment of cardiac function and health. However, 3D medical images are often acquired as 2D slices that are sparsely sampled and noisy, and mesh reconstruction on such data is a challenging task. Traditional voxel-based approaches rely on pre- and post-processing that compromises image fidelity, while mesh-level deep learning approaches require mesh annotations that are difficult to get. Therefore, direct cross-domain supervision from 2D images to meshes is a key technique for advancing 3D learning in medical imaging, but it has not been well-developed. While there have been attempts to approximate the optimized meshes&#39; slicing, few existing methods directly use 2D slices to supervise mesh reconstruction in a differentiable manner. Here, we propose a novel explicit differentiable voxelization and slicing (DVS) algorithm that allows gradient backpropagation to a mesh from its slices, facilitating refined mesh optimization directly supervised by the losses defined on 2D images. Further, we propose an innovative framework for extracting patient-specific left ventricle (LV) meshes from medical images by coupling DVS with a graph harmonic deformation (GHD) mesh morphing descriptor of cardiac shape that naturally preserves mesh quality and smoothness during optimization. Experimental results demonstrate that our method achieves state-of-the-art performance in cardiac mesh reconstruction tasks from CT and MRI, with an overall Dice score of 90% on multi-datasets, outperforming existing approaches. The proposed method can further quantify clinically useful parameters such as ejection fraction and global myocardial strains, closely matching the ground truth and surpassing the traditional voxel-based approach in sparse images. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.02070v2-abstract-full').style.display = 'none'; document.getElementById('2409.02070v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 3 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.02041</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Niu%2C+S">Shutong Niu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+R">Ruoyu Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Du%2C+J">Jun Du</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Gaobin Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Tu%2C+Y">Yanhui Tu</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+S">Siyuan Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Qian%2C+S">Shuangqing Qian</a>, <a href="/search/cs?searchtype=author&amp;query=Wu%2C+H">Huaxin Wu</a>, <a href="/search/cs?searchtype=author&amp;query=Xu%2C+H">Haitao Xu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xueyang Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhong%2C+G">Guolong Zhong</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+X">Xindi Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+J">Jieru Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+M">Mengzhi Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Cai%2C+D">Di Cai</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+T">Tian Gao</a>, <a href="/search/cs?searchtype=author&amp;query=Wan%2C+G">Genshun Wan</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+F">Feng Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Pan%2C+J">Jia Pan</a>, <a href="/search/cs?searchtype=author&amp;query=Gao%2C+J">Jianqing Gao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.02041v2-abstract-short" style="display: inline;"> This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several a&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.02041v2-abstract-full').style.display = 'inline'; document.getElementById('2409.02041v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.02041v2-abstract-full" style="display: none;"> This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several aspects: For front-end speech signal processing, we introduced a data-driven joint training method for diarization and separation (JDS) to enhance audio quality. Additionally, we also integrated traditional guided source separation (GSS) for multi-channel track to provide complementary information for the JDS. For back-end speech recognition, we enhanced Whisper with WavLM, ConvNeXt, and Transformer innovations, applying multi-task training and Noise KLD augmentation, to significantly advance ASR robustness and accuracy. Our system attained a Time-Constrained minimum Permutation Word Error Rate (tcpWER) of 14.265% and 22.989% on the CHiME-8 NOTSOFAR-1 Dev-set-2 multi-channel and single-channel tracks, respectively. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.02041v2-abstract-full').style.display = 'none'; document.getElementById('2409.02041v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 3 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.01571</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> CT-SDM: A Sampling Diffusion Model for Sparse-View CT Reconstruction across All Sampling Rates </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+L">Liutao Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+J">Jiahao Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+D">Daoqiang Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.01571v1-abstract-short" style="display: inline;"> Sparse views X-ray computed tomography has emerged as a contemporary technique to mitigate radiation dose. Because of the reduced number of projection views, traditional reconstruction methods can lead to severe artifacts. Recently, research studies utilizing deep learning methods has made promising progress in removing artifacts for Sparse-View Computed Tomography (SVCT). However, given the limit&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.01571v1-abstract-full').style.display = 'inline'; document.getElementById('2409.01571v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.01571v1-abstract-full" style="display: none;"> Sparse views X-ray computed tomography has emerged as a contemporary technique to mitigate radiation dose. Because of the reduced number of projection views, traditional reconstruction methods can lead to severe artifacts. Recently, research studies utilizing deep learning methods has made promising progress in removing artifacts for Sparse-View Computed Tomography (SVCT). However, given the limitations on the generalization capability of deep learning models, current methods usually train models on fixed sampling rates, affecting the usability and flexibility of model deployment in real clinical settings. To address this issue, our study proposes a adaptive reconstruction method to achieve high-performance SVCT reconstruction at any sampling rate. Specifically, we design a novel imaging degradation operator in the proposed sampling diffusion model for SVCT (CT-SDM) to simulate the projection process in the sinogram domain. Thus, the CT-SDM can gradually add projection views to highly undersampled measurements to generalize the full-view sinograms. By choosing an appropriate starting point in diffusion inference, the proposed model can recover the full-view sinograms from any sampling rate with only one trained model. Experiments on several datasets have verified the effectiveness and robustness of our approach, demonstrating its superiority in reconstructing high-quality images from sparse-view CT scans across various sampling rates. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.01571v1-abstract-full').style.display = 'none'; document.getElementById('2409.01571v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.01544</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Learning Task-Specific Sampling Strategy for Sparse-View CT Reconstruction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+L">Liutao Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+J">Jiahao Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Fang%2C+Y">Yingying Fang</a>, <a href="/search/cs?searchtype=author&amp;query=Aviles-Rivero%2C+A+I">Angelica I Aviles-Rivero</a>, <a href="/search/cs?searchtype=author&amp;query=Schonlieb%2C+C">Carola-Bibiane Schonlieb</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+D">Daoqiang Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Guang Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.01544v1-abstract-short" style="display: inline;"> Sparse-View Computed Tomography (SVCT) offers low-dose and fast imaging but suffers from severe artifacts. Optimizing the sampling strategy is an essential approach to improving the imaging quality of SVCT. However, current methods typically optimize a universal sampling strategy for all types of scans, overlooking the fact that the optimal strategy may vary depending on the specific scanning task&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.01544v1-abstract-full').style.display = 'inline'; document.getElementById('2409.01544v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.01544v1-abstract-full" style="display: none;"> Sparse-View Computed Tomography (SVCT) offers low-dose and fast imaging but suffers from severe artifacts. Optimizing the sampling strategy is an essential approach to improving the imaging quality of SVCT. However, current methods typically optimize a universal sampling strategy for all types of scans, overlooking the fact that the optimal strategy may vary depending on the specific scanning task, whether it involves particular body scans (e.g., chest CT scans) or downstream clinical applications (e.g., disease diagnosis). The optimal strategy for one scanning task may not perform as well when applied to other tasks. To address this problem, we propose a deep learning framework that learns task-specific sampling strategies with a multi-task approach to train a unified reconstruction network while tailoring optimal sampling strategies for each individual task. Thus, a task-specific sampling strategy can be applied for each type of scans to improve the quality of SVCT imaging and further assist in performance of downstream clinical usage. Extensive experiments across different scanning types provide validation for the effectiveness of task-specific sampling strategies in enhancing imaging quality. Experiments involving downstream tasks verify the clinical value of learned sampling strategies, as evidenced by notable improvements in downstream task performance. Furthermore, the utilization of a multi-task framework with a shared reconstruction network facilitates deployment on current imaging devices with switchable task-specific modules, and allows for easily integrate new tasks without retraining the entire model. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.01544v1-abstract-full').style.display = 'none'; document.getElementById('2409.01544v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.00947</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> XNet v2: Fewer Limitations, Better Results and Greater Universality </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yanfeng Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+L">Lingrui Li</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zichen Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+G">Guole Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Z">Ziwen Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+G">Ge Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.00947v1-abstract-short" style="display: inline;"> XNet introduces a wavelet-based X-shaped unified architecture for fully- and semi-supervised biomedical segmentation. So far, however, XNet still faces the limitations, including performance degradation when images lack high-frequency (HF) information, underutilization of raw images and insufficient fusion. To address these issues, we propose XNet v2, a low- and high-frequency complementary model.&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.00947v1-abstract-full').style.display = 'inline'; document.getElementById('2409.00947v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.00947v1-abstract-full" style="display: none;"> XNet introduces a wavelet-based X-shaped unified architecture for fully- and semi-supervised biomedical segmentation. So far, however, XNet still faces the limitations, including performance degradation when images lack high-frequency (HF) information, underutilization of raw images and insufficient fusion. To address these issues, we propose XNet v2, a low- and high-frequency complementary model. XNet v2 performs wavelet-based image-level complementary fusion, using fusion results along with raw images inputs three different sub-networks to construct consistency loss. Furthermore, we introduce a feature-level fusion module to enhance the transfer of low-frequency (LF) information and HF information. XNet v2 achieves state-of-the-art in semi-supervised segmentation while maintaining competitve results in fully-supervised learning. More importantly, XNet v2 excels in scenarios where XNet fails. Compared to XNet, XNet v2 exhibits fewer limitations, better results and greater universality. Extensive experiments on three 2D and two 3D datasets demonstrate the effectiveness of XNet v2. 