aria-label="Page 6" aria-current="page">6 </a> </li> </ul> </nav> <ol class="breathe-horizontal" start="1"> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.03292</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Software Engineering">cs.SE</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Human-Computer Interaction">cs.HC</span> </div> </div> <p class="title is-5 mathjax"> Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation? </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Xiao%2C+J">Jingyu Xiao</a>, <a href="/search/cs?searchtype=author&query=Wan%2C+Y">Yuxuan Wan</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yintong Huo</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+Z">Zhiyao Xu</a>, <a href="/search/cs?searchtype=author&query=Lyu%2C+M+R">Michael R. Lyu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.03292v1-abstract-short" style="display: inline;"> Converting webpage design into functional UI code is a critical step for building websites, which can be labor-intensive and time-consuming. To automate this design-to-code transformation process, various automated methods using learning-based networks and multi-modal large language models (MLLMs) have been proposed. However, these studies were merely evaluated on a narrow range of static web page… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03292v1-abstract-full').style.display = 'inline'; document.getElementById('2411.03292v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.03292v1-abstract-full" style="display: none;"> Converting webpage design into functional UI code is a critical step for building websites, which can be labor-intensive and time-consuming. To automate this design-to-code transformation process, various automated methods using learning-based networks and multi-modal large language models (MLLMs) have been proposed. However, these studies were merely evaluated on a narrow range of static web pages and ignored dynamic interaction elements, making them less practical for real-world website deployment. To fill in the blank, we present the first systematic investigation of MLLMs in generating interactive webpages. Specifically, we first formulate the Interaction-to-Code task and build the Interaction2Code benchmark that contains 97 unique web pages and 213 distinct interactions, spanning 15 webpage types and 30 interaction categories. We then conduct comprehensive experiments on three state-of-the-art (SOTA) MLLMs using both automatic metrics and human evaluations, thereby summarizing six findings accordingly. Our experimental results highlight the limitations of MLLMs in generating fine-grained interactive features and managing interactions with complex transformations and subtle visual modifications. We further analyze failure cases and their underlying causes, identifying 10 common failure types and assessing their severity. Additionally, our findings reveal three critical influencing factors, i.e., prompts, visual saliency, and textual descriptions, that can enhance the interaction generation performance of MLLMs. Based on these findings, we elicit implications for researchers and developers, providing a foundation for future advancements in this field. Datasets and source code are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.03292v1-abstract-full').style.display = 'none'; document.getElementById('2411.03292v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.00078</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> </div> </div> <p class="title is-5 mathjax"> How Good Are We? Evaluating Cell AI Foundation Models in Kidney Pathology with Human-in-the-Loop Enrichment </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Guo%2C+J">Junlin Guo</a>, <a href="/search/cs?searchtype=author&query=Lu%2C+S">Siqi Lu</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+C">Can Cui</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Tao%2C+Z">Zhewen Tao</a>, <a href="/search/cs?searchtype=author&query=Lin%2C+Y">Yizhe Lin</a>, <a href="/search/cs?searchtype=author&query=Lionts%2C+M">Marilyn Lionts</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Quan Liu</a>, <a href="/search/cs?searchtype=author&query=Xiong%2C+J">Juming Xiong</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Y">Yu Wang</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+S">Shilin Zhao</a>, <a href="/search/cs?searchtype=author&query=Chang%2C+C">Catie Chang</a>, <a href="/search/cs?searchtype=author&query=Wilkes%2C+M">Mitchell Wilkes</a>, <a href="/search/cs?searchtype=author&query=Yin%2C+M">Mengmeng Yin</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+H">Haichun Yang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.00078v1-abstract-short" style="display: inline;"> Training AI foundation models has emerged as a promising large-scale learning approach for addressing real-world healthcare challenges, including digital pathology. While many of these models have been developed for tasks like disease diagnosis and tissue quantification using extensive and diverse training datasets, their readiness for deployment on some arguably simplest tasks, such as nuclei seg… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00078v1-abstract-full').style.display = 'inline'; document.getElementById('2411.00078v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.00078v1-abstract-full" style="display: none;"> Training AI foundation models has emerged as a promising large-scale learning approach for addressing real-world healthcare challenges, including digital pathology. While many of these models have been developed for tasks like disease diagnosis and tissue quantification using extensive and diverse training datasets, their readiness for deployment on some arguably simplest tasks, such as nuclei segmentation within a single organ (e.g., the kidney), remains uncertain. This paper seeks to answer this key question, "How good are we?", by thoroughly evaluating the performance of recent cell foundation models on a curated multi-center, multi-disease, and multi-species external testing dataset. Additionally, we tackle a more challenging question, "How can we improve?", by developing and assessing human-in-the-loop data enrichment strategies aimed at enhancing model performance while minimizing the reliance on pixel-level human annotation. To address the first question, we curated a multicenter, multidisease, and multispecies dataset consisting of 2,542 kidney whole slide images (WSIs). Three state-of-the-art (SOTA) cell foundation models-Cellpose, StarDist, and CellViT-were selected for evaluation. To tackle the second question, we explored data enrichment algorithms by distilling predictions from the different foundation models with a human-in-the-loop framework, aiming to further enhance foundation model performance with minimal human efforts. Our experimental results showed that all three foundation models improved over their baselines with model fine-tuning with enriched data. Interestingly, the baseline model with the highest F1 score does not yield the best segmentation outcomes after fine-tuning. This study establishes a benchmark for the development and deployment of cell vision foundation models tailored for real-world data applications. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00078v1-abstract-full').style.display = 'none'; document.getElementById('2411.00078v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 31 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.22454</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Brain age identification from diffusion MRI synergistically predicts neurodegenerative disease </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Gao%2C+C">Chenyu Gao</a>, <a href="/search/cs?searchtype=author&query=Kim%2C+M+E">Michael E. Kim</a>, <a href="/search/cs?searchtype=author&query=Ramadass%2C+K">Karthik Ramadass</a>, <a href="/search/cs?searchtype=author&query=Kanakaraj%2C+P">Praitayini Kanakaraj</a>, <a href="/search/cs?searchtype=author&query=Krishnan%2C+A+R">Aravind R. Krishnan</a>, <a href="/search/cs?searchtype=author&query=Saunders%2C+A+M">Adam M. Saunders</a>, <a href="/search/cs?searchtype=author&query=Newlin%2C+N+R">Nancy R. Newlin</a>, <a href="/search/cs?searchtype=author&query=Lee%2C+H+H">Ho Hin Lee</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+Q">Qi Yang</a>, <a href="/search/cs?searchtype=author&query=Taylor%2C+W+D">Warren D. Taylor</a>, <a href="/search/cs?searchtype=author&query=Boyd%2C+B+D">Brian D. Boyd</a>, <a href="/search/cs?searchtype=author&query=Beason-Held%2C+L+L">Lori L. Beason-Held</a>, <a href="/search/cs?searchtype=author&query=Resnick%2C+S+M">Susan M. Resnick</a>, <a href="/search/cs?searchtype=author&query=Barnes%2C+L+L">Lisa L. Barnes</a>, <a href="/search/cs?searchtype=author&query=Bennett%2C+D+A">David A. Bennett</a>, <a href="/search/cs?searchtype=author&query=Van+Schaik%2C+K+D">Katherine D. Van Schaik</a>, <a href="/search/cs?searchtype=author&query=Archer%2C+D+B">Derek B. Archer</a>, <a href="/search/cs?searchtype=author&query=Hohman%2C+T+J">Timothy J. Hohman</a>, <a href="/search/cs?searchtype=author&query=Jefferson%2C+A+L">Angela L. Jefferson</a>, <a href="/search/cs?searchtype=author&query=I%C5%A1gum%2C+I">Ivana I拧gum</a>, <a href="/search/cs?searchtype=author&query=Moyer%2C+D">Daniel Moyer</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a>, <a href="/search/cs?searchtype=author&query=Schilling%2C+K+G">Kurt G. Schilling</a>, <a href="/search/cs?searchtype=author&query=Zuo%2C+L">Lianrui Zuo</a>, <a href="/search/cs?searchtype=author&query=Bao%2C+S">Shunxing Bao</a> , et al. (4 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.22454v1-abstract-short" style="display: inline;"> Estimated brain age from magnetic resonance image (MRI) and its deviation from chronological age can provide early insights into potential neurodegenerative diseases, supporting early detection and implementation of prevention strategies. Diffusion MRI (dMRI), a widely used modality for brain age estimation, presents an opportunity to build an earlier biomarker for neurodegenerative disease predic… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.22454v1-abstract-full').style.display = 'inline'; document.getElementById('2410.22454v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.22454v1-abstract-full" style="display: none;"> Estimated brain age from magnetic resonance image (MRI) and its deviation from chronological age can provide early insights into potential neurodegenerative diseases, supporting early detection and implementation of prevention strategies. Diffusion MRI (dMRI), a widely used modality for brain age estimation, presents an opportunity to build an earlier biomarker for neurodegenerative disease prediction because it captures subtle microstructural changes that precede more perceptible macrostructural changes. However, the coexistence of macro- and micro-structural information in dMRI raises the question of whether current dMRI-based brain age estimation models are leveraging the intended microstructural information or if they inadvertently rely on the macrostructural information. To develop a microstructure-specific brain age, we propose a method for brain age identification from dMRI that minimizes the model's use of macrostructural information by non-rigidly registering all images to a standard template. Imaging data from 13,398 participants across 12 datasets were used for the training and evaluation. We compare our brain age models, trained with and without macrostructural information minimized, with an architecturally similar T1-weighted (T1w) MRI-based brain age model and two state-of-the-art T1w MRI-based brain age models that primarily use macrostructural information. We observe difference between our dMRI-based brain age and T1w MRI-based brain age across stages of neurodegeneration, with dMRI-based brain age being older than T1w MRI-based brain age in participants transitioning from cognitively normal (CN) to mild cognitive impairment (MCI), but younger in participants already diagnosed with Alzheimer's disease (AD). Approximately 4 years before MCI diagnosis, dMRI-based brain age yields better performance than T1w MRI-based brain ages in predicting transition from CN to MCI. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.22454v1-abstract-full').style.display = 'none'; document.getElementById('2410.22454v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 29 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.18551</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> IMAN: An Adaptive Network for Robust NPC Mortality Prediction with Missing Modalities </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yejing Huo</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+G">Guoheng Huang</a>, <a href="/search/cs?searchtype=author&query=Cheng%2C+L">Lianglun Cheng</a>, <a href="/search/cs?searchtype=author&query=He%2C+J">Jianbin He</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+X">Xuhang Chen</a>, <a href="/search/cs?searchtype=author&query=Yuan%2C+X">Xiaochen Yuan</a>, <a href="/search/cs?searchtype=author&query=Zhong%2C+G">Guo Zhong</a>, <a href="/search/cs?searchtype=author&query=Pun%2C+C">Chi-Man Pun</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.18551v1-abstract-short" style="display: inline;"> Accurate prediction of mortality in nasopharyngeal carcinoma (NPC), a complex malignancy particularly challenging in advanced stages, is crucial for optimizing treatment strategies and improving patient outcomes. However, this predictive process is often compromised by the high-dimensional and heterogeneous nature of NPC-related data, coupled with the pervasive issue of incomplete multi-modal data… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18551v1-abstract-full').style.display = 'inline'; document.getElementById('2410.18551v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.18551v1-abstract-full" style="display: none;"> Accurate prediction of mortality in nasopharyngeal carcinoma (NPC), a complex malignancy particularly challenging in advanced stages, is crucial for optimizing treatment strategies and improving patient outcomes. However, this predictive process is often compromised by the high-dimensional and heterogeneous nature of NPC-related data, coupled with the pervasive issue of incomplete multi-modal data, manifesting as missing radiological images or incomplete diagnostic reports. Traditional machine learning approaches suffer significant performance degradation when faced with such incomplete data, as they fail to effectively handle the high-dimensionality and intricate correlations across modalities. Even advanced multi-modal learning techniques like Transformers struggle to maintain robust performance in the presence of missing modalities, as they lack specialized mechanisms to adaptively integrate and align the diverse data types, while also capturing nuanced patterns and contextual relationships within the complex NPC data. To address these problem, we introduce IMAN: an adaptive network for robust NPC mortality prediction with missing modalities. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18551v1-abstract-full').style.display = 'none'; document.getElementById('2410.18551v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">The paper has been accepted by BIBM 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.18101</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Chemical Physics">physics.chem-ph</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Molecular Dynamics and Machine Learning Unlock Possibilities in Beauty Design -- A Perspective </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Xu%2C+Y">Yuzhi Xu</a>, <a href="/search/cs?searchtype=author&query=Ni%2C+H">Haowei Ni</a>, <a href="/search/cs?searchtype=author&query=Gao%2C+Q">Qinhui Gao</a>, <a href="/search/cs?searchtype=author&query=Chang%2C+C">Chia-Hua Chang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yanran Huo</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+F">Fanyu Zhao</a>, <a href="/search/cs?searchtype=author&query=Hu%2C+S">Shiyu Hu</a>, <a href="/search/cs?searchtype=author&query=Xia%2C+W">Wei Xia</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+Y">Yike Zhang</a>, <a href="/search/cs?searchtype=author&query=Grovu%2C+R">Radu Grovu</a>, <a href="/search/cs?searchtype=author&query=He%2C+M">Min He</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+J+Z+H">John. Z. H. Zhang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Y">Yuanqing Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.18101v2-abstract-short" style="display: inline;"> Computational molecular design -- the endeavor to design molecules, with various missions, aided by machine learning and molecular dynamics approaches, has been widely applied to create valuable new molecular entities, from small molecule therapeutics to protein biologics. In the small data regime, physics-based approaches model the interaction between the molecule being designed and proteins of k… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18101v2-abstract-full').style.display = 'inline'; document.getElementById('2410.18101v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.18101v2-abstract-full" style="display: none;"> Computational molecular design -- the endeavor to design molecules, with various missions, aided by machine learning and molecular dynamics approaches, has been widely applied to create valuable new molecular entities, from small molecule therapeutics to protein biologics. In the small data regime, physics-based approaches model the interaction between the molecule being designed and proteins of key physiological functions, providing structural insights into the mechanism. When abundant data has been collected, a quantitative structure-activity relationship (QSAR) can be more directly constructed from experimental data, from which machine learning can distill key insights to guide the design of the next round of experiment design. Machine learning methodologies can also facilitate physical modeling, from improving the accuracy of force fields and extending them to unseen chemical spaces, to more directly enhancing the sampling on the conformational spaces. We argue that these techniques are mature enough to be applied to not just extend the longevity of life, but the beauty it manifests. In this perspective, we review the current frontiers in the research \& development of skin care products, as well as the statistical and physical toolbox applicable to addressing the challenges in this industry. Feasible interdisciplinary research projects are proposed to harness the power of machine learning tools to design innovative, effective, and inexpensive skin care products. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.18101v2-abstract-full').style.display = 'none'; document.getElementById('2410.18101v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 8 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.17855</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> TAGE: Trustworthy Attribute Group Editing for Stable Few-shot Image Generation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhang%2C+R">Ruicheng Zhang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+G">Guoheng Huang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yejing Huo</a>, <a href="/search/cs?searchtype=author&query=Yuan%2C+X">Xiaochen Yuan</a>, <a href="/search/cs?searchtype=author&query=Zhou%2C+Z">Zhizhen Zhou</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+X">Xuhang Chen</a>, <a href="/search/cs?searchtype=author&query=Zhong%2C+G">Guo Zhong</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.17855v1-abstract-short" style="display: inline;"> Generative Adversarial Networks (GANs) have emerged as a prominent research focus for image editing tasks, leveraging the powerful image generation capabilities of the GAN framework to produce remarkable results.However, prevailing approaches are contingent upon extensive training datasets and explicit supervision, presenting a significant challenge in manipulating the diverse attributes of new im… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17855v1-abstract-full').style.display = 'inline'; document.getElementById('2410.17855v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.17855v1-abstract-full" style="display: none;"> Generative Adversarial Networks (GANs) have emerged as a prominent research focus for image editing tasks, leveraging the powerful image generation capabilities of the GAN framework to produce remarkable results.However, prevailing approaches are contingent upon extensive training datasets and explicit supervision, presenting a significant challenge in manipulating the diverse attributes of new image classes with limited sample availability. To surmount this hurdle, we introduce TAGE, an innovative image generation network comprising three integral modules: the Codebook Learning Module (CLM), the Code Prediction Module (CPM) and the Prompt-driven Semantic Module (PSM). The CPM module delves into the semantic dimensions of category-agnostic attributes, encapsulating them within a discrete codebook. This module is predicated on the concept that images are assemblages of attributes, and thus, by editing these category-independent attributes, it is theoretically possible to generate images from unseen categories. Subsequently, the CPM module facilitates naturalistic image editing by predicting indices of category-independent attribute vectors within the codebook. Additionally, the PSM module generates semantic cues that are seamlessly integrated into the Transformer architecture of the CPM, enhancing the model's comprehension of the targeted attributes for editing. With these semantic cues, the model can generate images that accentuate desired attributes more prominently while maintaining the integrity of the original category, even with a limited number of samples. We have conducted extensive experiments utilizing the Animal Faces, Flowers, and VGGFaces datasets. The results of these experiments demonstrate that our proposed method not only achieves superior performance but also exhibits a high degree of stability when compared to other few-shot image generation techniques. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.17855v1-abstract-full').style.display = 'none'; document.getElementById('2410.17855v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 23 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by International Conference on Signal Processing Systems Conference</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.16166</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huang%2C+H">Han Huang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuqi Huo</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+Z">Zijia Zhao</a>, <a href="/search/cs?searchtype=author&query=Lu%2C+H">Haoyu Lu</a>, <a href="/search/cs?searchtype=author&query=Wu%2C+S">Shu Wu</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+B">Bingning Wang</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Qiang Liu</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+W">Weipeng Chen</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+L">Liang Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.16166v1-abstract-short" style="display: inline;"> Multimodal large language models (MLLMs) have made significant strides by integrating visual and textual modalities. A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets. However, $\textit {de facto}$ filter-based data quality enhancement paradigms often discard a substantial portion of high-quality image data due to inadequate semantic alig… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.16166v1-abstract-full').style.display = 'inline'; document.getElementById('2410.16166v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.16166v1-abstract-full" style="display: none;"> Multimodal large language models (MLLMs) have made significant strides by integrating visual and textual modalities. A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets. However, $\textit {de facto}$ filter-based data quality enhancement paradigms often discard a substantial portion of high-quality image data due to inadequate semantic alignment between images and texts, leading to inefficiencies in data utilization and scalability. In this paper, we propose the Adaptive Image-Text Quality Enhancer (AITQE), a model that dynamically assesses and enhances the quality of image-text pairs. AITQE employs a text rewriting mechanism for low-quality pairs and incorporates a negative sample learning strategy to improve evaluative capabilities by integrating deliberately selected low-quality samples during training. Unlike prior approaches that significantly alter text distributions, our method minimally adjusts text to preserve data volume while enhancing quality. Experimental results demonstrate that AITQE surpasses existing methods on various benchmark, effectively leveraging raw data and scaling efficiently with increasing data volumes. We hope our work will inspire future works. The code and model are available at: <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.16166v1-abstract-full').style.display = 'none'; document.getElementById('2410.16166v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.13694</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Exploring the Design Space of Visual Context Representation in Video MLLMs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Du%2C+Y">Yifan Du</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuqi Huo</a>, <a href="/search/cs?searchtype=author&query=Zhou%2C+K">Kun Zhou</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+Z">Zijia Zhao</a>, <a href="/search/cs?searchtype=author&query=Lu%2C+H">Haoyu Lu</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+H">Han Huang</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+W+X">Wayne Xin Zhao</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+B">Bingning Wang</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+W">Weipeng Chen</a>, <a href="/search/cs?searchtype=author&query=Wen%2C+J">Ji-Rong Wen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.13694v1-abstract-short" style="display: inline;"> Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks. Despite the advancements, there is still a lack of systematic research on visual context representation, which refers to the scheme to select frames from a video and further select the tokens from a frame. In this paper, we explore the design space for v… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13694v1-abstract-full').style.display = 'inline'; document.getElementById('2410.13694v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.13694v1-abstract-full" style="display: none;"> Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks. Despite the advancements, there is still a lack of systematic research on visual context representation, which refers to the scheme to select frames from a video and further select the tokens from a frame. In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes. Firstly, we formulate the task of visual context representation as a constrained optimization problem, and model the language modeling loss as a function of the number of frames and the number of embeddings (or tokens) per frame, given the maximum visual context window size. Then, we explore the scaling effects in frame selection and token selection respectively, and fit the corresponding function curve by conducting extensive empirical experiments. We examine the effectiveness of typical selection strategies and present empirical findings to determine the two factors. Furthermore, we study the joint effect of frame selection and token selection, and derive the optimal formula for determining the two factors. We demonstrate that the derived optimal settings show alignment with the best-performed results of empirical experiments. Our code and model are available at: <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.13694v1-abstract-full').style.display = 'none'; document.getElementById('2410.13694v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Long Video MLLM; work in progress</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.08565</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Ocean-omni: To Understand the World with Omni-modality </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Li%2C+Y">Yadong Li</a>, <a href="/search/cs?searchtype=author&query=Sun%2C+H">Haoze Sun</a>, <a href="/search/cs?searchtype=author&query=Lin%2C+M">Mingan Lin</a>, <a href="/search/cs?searchtype=author&query=Li%2C+T">Tianpeng Li</a>, <a href="/search/cs?searchtype=author&query=Dong%2C+G">Guosheng Dong</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+T">Tao Zhang</a>, <a href="/search/cs?searchtype=author&query=Ding%2C+B">Bowen Ding</a>, <a href="/search/cs?searchtype=author&query=Song%2C+W">Wei Song</a>, <a href="/search/cs?searchtype=author&query=Cheng%2C+Z">Zhenglin Cheng</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuqi Huo</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+S">Song Chen</a>, <a href="/search/cs?searchtype=author&query=Li%2C+X">Xu Li</a>, <a href="/search/cs?searchtype=author&query=Pan%2C+D">Da Pan</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+S">Shusen Zhang</a>, <a href="/search/cs?searchtype=author&query=Wu%2C+X">Xin Wu</a>, <a href="/search/cs?searchtype=author&query=Liang%2C+Z">Zheng Liang</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+J">Jun Liu</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+T">Tao Zhang</a>, <a href="/search/cs?searchtype=author&query=Lu%2C+K">Keer Lu</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+Y">Yaqi Zhao</a>, <a href="/search/cs?searchtype=author&query=Shen%2C+Y">Yanjun Shen</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+F">Fan Yang</a>, <a href="/search/cs?searchtype=author&query=Yu%2C+K">Kaicheng Yu</a>, <a href="/search/cs?searchtype=author&query=Lin%2C+T">Tao Lin</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+J">Jianhua Xu</a> , et al. (2 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.08565v3-abstract-short" style="display: inline;"> The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Ocean-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.08565v3-abstract-full').style.display = 'inline'; document.getElementById('2410.08565v3-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.08565v3-abstract-full" style="display: none;"> The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Ocean-omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the language model with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.08565v3-abstract-full').style.display = 'none'; document.getElementById('2410.08565v3-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 11 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.05111</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> LiDAR-GS:Real-time LiDAR Re-Simulation using Gaussian Splatting </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Chen%2C+Q">Qifeng Chen</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+S">Sheng Yang</a>, <a href="/search/cs?searchtype=author&query=Du%2C+S">Sicong Du</a>, <a href="/search/cs?searchtype=author&query=Tang%2C+T">Tao Tang</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+P">Peng Chen</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuchi Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.05111v1-abstract-short" style="display: inline;"> LiDAR simulation plays a crucial role in closed-loop simulation for autonomous driving. Although recent advancements, such as the use of reconstructed mesh and Neural Radiance Fields (NeRF), have made progress in simulating the physical properties of LiDAR, these methods have struggled to achieve satisfactory frame rates and rendering quality. To address these limitations, we present LiDAR-GS, the… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.05111v1-abstract-full').style.display = 'inline'; document.getElementById('2410.05111v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.05111v1-abstract-full" style="display: none;"> LiDAR simulation plays a crucial role in closed-loop simulation for autonomous driving. Although recent advancements, such as the use of reconstructed mesh and Neural Radiance Fields (NeRF), have made progress in simulating the physical properties of LiDAR, these methods have struggled to achieve satisfactory frame rates and rendering quality. To address these limitations, we present LiDAR-GS, the first LiDAR Gaussian Splatting method, for real-time high-fidelity re-simulation of LiDAR sensor scans in public urban road scenes. The vanilla Gaussian Splatting, designed for camera models, cannot be directly applied to LiDAR re-simulation. To bridge the gap between passive camera and active LiDAR, our LiDAR-GS designs a differentiable laser beam splatting, grounded in the LiDAR range view model. This innovation allows for precise surface splatting by projecting lasers onto micro cross-sections, effectively eliminating artifacts associated with local affine approximations. Additionally, LiDAR-GS leverages Neural Gaussian Fields, which further integrate view-dependent clues, to represent key LiDAR properties that are influenced by the incident angle and external factors. Combining these practices with some essential adaptations, e.g., dynamic instances decomposition, our approach succeeds in simultaneously re-simulating depth, intensity, and ray-drop channels, achieving state-of-the-art results in both rendering frame rate and quality on publically available large scene datasets. Our source code will be made publicly available. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.05111v1-abstract-full').style.display = 'none'; document.getElementById('2410.05111v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.04503</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> LRHP: Learning Representations for Human Preferences via Preference Pairs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Wang%2C+C">Chenglong Wang</a>, <a href="/search/cs?searchtype=author&query=Gan%2C+Y">Yang Gan</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yifu Huo</a>, <a href="/search/cs?searchtype=author&query=Mu%2C+Y">Yongyu Mu</a>, <a href="/search/cs?searchtype=author&query=He%2C+Q">Qiaozhi He</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+M">Murun Yang</a>, <a href="/search/cs?searchtype=author&query=Xiao%2C+T">Tong Xiao</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+C">Chunliang Zhang</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+T">Tongran Liu</a>, <a href="/search/cs?searchtype=author&query=Zhu%2C+J">Jingbo Zhu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.04503v1-abstract-short" style="display: inline;"> To improve human-preference alignment training, current research has developed numerous preference datasets consisting of preference pairs labeled as "preferred" or "dispreferred". These preference pairs are typically used to encode human preferences into a single numerical value through reward modeling, which acts as a reward signal during reinforcement learning from human feedback (RLHF). Howeve… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.04503v1-abstract-full').style.display = 'inline'; document.getElementById('2410.04503v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.04503v1-abstract-full" style="display: none;"> To improve human-preference alignment training, current research has developed numerous preference datasets consisting of preference pairs labeled as "preferred" or "dispreferred". These preference pairs are typically used to encode human preferences into a single numerical value through reward modeling, which acts as a reward signal during reinforcement learning from human feedback (RLHF). However, representing these human preferences as a numerical value complicates the analysis of these preferences and restricts their broader applications other than RLHF. In contrast, in this work, we introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences. We further develop a more generalizable framework, Learning Representations for Human Preferences via preference pairs (namely LRHP), which extends beyond traditional reward modeling to tackle this task. We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction. Building upon the human preferences in representations, we achieve strong performance in both tasks, significantly outperforming baselines. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.04503v1-abstract-full').style.display = 'none'; document.getElementById('2410.04503v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.03778</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> SGW-based Multi-Task Learning in Vision Tasks </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhang%2C+R">Ruiyuan Zhang</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+Y">Yuyao Chen</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuchi Huo</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+J">Jiaxiang Liu</a>, <a href="/search/cs?searchtype=author&query=Xi%2C+D">Dianbing Xi</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+J">Jie Liu</a>, <a href="/search/cs?searchtype=author&query=Wu%2C+C">Chao Wu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.03778v1-abstract-short" style="display: inline;"> Multi-task-learning(MTL) is a multi-target optimization task. Neural networks try to realize each target using a shared interpretative space within MTL. However, as the scale of datasets expands and the complexity of tasks increases, knowledge sharing becomes increasingly challenging. In this paper, we first re-examine previous cross-attention MTL methods from the perspective of noise. We theoreti… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.03778v1-abstract-full').style.display = 'inline'; document.getElementById('2410.03778v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.03778v1-abstract-full" style="display: none;"> Multi-task-learning(MTL) is a multi-target optimization task. Neural networks try to realize each target using a shared interpretative space within MTL. However, as the scale of datasets expands and the complexity of tasks increases, knowledge sharing becomes increasingly challenging. In this paper, we first re-examine previous cross-attention MTL methods from the perspective of noise. We theoretically analyze this issue and identify it as a flaw in the cross-attention mechanism. To address this issue, we propose an information bottleneck knowledge extraction module (KEM). This module aims to reduce inter-task interference by constraining the flow of information, thereby reducing computational complexity. Furthermore, we have employed neural collapse to stabilize the knowledge-selection process. That is, before input to KEM, we projected the features into ETF space. This mapping makes our method more robust. We implemented and conducted comparative experiments with this method on multiple datasets. The results demonstrate that our approach significantly outperforms existing methods in multi-task learning. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.03778v1-abstract-full').style.display = 'none'; document.getElementById('2410.03778v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> ACCV2024 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.01148</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Automatic Image Unfolding and Stitching Framework for Esophageal Lining Video Based on Density-Weighted Feature Matching </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Li%2C+M">Muyang Li</a>, <a href="/search/cs?searchtype=author&query=Xiong%2C+J">Juming Xiong</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Tyree%2C+R+N">Regina N Tyree</a>, <a href="/search/cs?searchtype=author&query=Hiremath%2C+G">Girish Hiremath</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.01148v1-abstract-short" style="display: inline;"> Endoscopy is a crucial tool for diagnosing the gastrointestinal tract, but its effectiveness is often limited by a narrow field of view and the dynamic nature of the internal environment, especially in the esophagus, where complex and repetitive patterns make image stitching challenging. This paper introduces a novel automatic image unfolding and stitching framework tailored for esophageal videos… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.01148v1-abstract-full').style.display = 'inline'; document.getElementById('2410.01148v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.01148v1-abstract-full" style="display: none;"> Endoscopy is a crucial tool for diagnosing the gastrointestinal tract, but its effectiveness is often limited by a narrow field of view and the dynamic nature of the internal environment, especially in the esophagus, where complex and repetitive patterns make image stitching challenging. This paper introduces a novel automatic image unfolding and stitching framework tailored for esophageal videos captured during endoscopy. The method combines feature matching algorithms, including LoFTR, SIFT, and ORB, to create a feature filtering pool and employs a Density-Weighted Homography Optimization (DWHO) algorithm to enhance stitching accuracy. By merging consecutive frames, the framework generates a detailed panoramic view of the esophagus, enabling thorough and accurate visual analysis. Experimental results show the framework achieves low Root Mean Square Error (RMSE) and high Structural Similarity Index (SSIM) across extensive video sequences, demonstrating its potential for clinical use and improving the quality and continuity of endoscopic visual data. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.01148v1-abstract-full').style.display = 'none'; document.getElementById('2410.01148v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.13846</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Multi-Modality Conditioned Variational U-Net for Field-of-View Extension in Brain Diffusion MRI </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Li%2C+Z">Zhiyuan Li</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Kanakaraj%2C+P">Praitayini Kanakaraj</a>, <a href="/search/cs?searchtype=author&query=Gao%2C+C">Chenyu Gao</a>, <a href="/search/cs?searchtype=author&query=Bao%2C+S">Shunxing Bao</a>, <a href="/search/cs?searchtype=author&query=Zuo%2C+L">Lianrui Zuo</a>, <a href="/search/cs?searchtype=author&query=Kim%2C+M+E">Michael E. Kim</a>, <a href="/search/cs?searchtype=author&query=Newlin%2C+N+R">Nancy R. Newlin</a>, <a href="/search/cs?searchtype=author&query=Rudravaram%2C+G">Gaurav Rudravaram</a>, <a href="/search/cs?searchtype=author&query=Khairi%2C+N+M">Nazirah M. Khairi</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a>, <a href="/search/cs?searchtype=author&query=Schilling%2C+K+G">Kurt G. Schilling</a>, <a href="/search/cs?searchtype=author&query=Kukull%2C+W+A">Walter A. Kukull</a>, <a href="/search/cs?searchtype=author&query=Toga%2C+A+W">Arthur W. Toga</a>, <a href="/search/cs?searchtype=author&query=Archer%2C+D+B">Derek B. Archer</a>, <a href="/search/cs?searchtype=author&query=Hohman%2C+T+J">Timothy J. Hohman</a>, <a href="/search/cs?searchtype=author&query=Landman%2C+B+A">Bennett A. Landman</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.13846v1-abstract-short" style="display: inline;"> An incomplete field-of-view (FOV) in diffusion magnetic resonance imaging (dMRI) can severely hinder the volumetric and bundle analyses of whole-brain white matter connectivity. Although existing works have investigated imputing the missing regions using deep generative models, it remains unclear how to specifically utilize additional information from paired multi-modality data and whether this ca… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.13846v1-abstract-full').style.display = 'inline'; document.getElementById('2409.13846v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.13846v1-abstract-full" style="display: none;"> An incomplete field-of-view (FOV) in diffusion magnetic resonance imaging (dMRI) can severely hinder the volumetric and bundle analyses of whole-brain white matter connectivity. Although existing works have investigated imputing the missing regions using deep generative models, it remains unclear how to specifically utilize additional information from paired multi-modality data and whether this can enhance the imputation quality and be useful for downstream tractography. To fill this gap, we propose a novel framework for imputing dMRI scans in the incomplete part of the FOV by integrating the learned diffusion features in the acquired part of the FOV to the complete brain anatomical structure. We hypothesize that by this design the proposed framework can enhance the imputation performance of the dMRI scans and therefore be useful for repairing whole-brain tractography in corrupted dMRI scans with incomplete FOV. We tested our framework on two cohorts from different sites with a total of 96 subjects and compared it with a baseline imputation method that treats the information from T1w and dMRI scans equally. The proposed framework achieved significant improvements in imputation performance, as demonstrated by angular correlation coefficient (p < 1E-5), and in downstream tractography accuracy, as demonstrated by Dice score (p < 0.01). Results suggest that the proposed framework improved imputation performance in dMRI scans by specifically utilizing additional information from paired multi-modality data, compared with the baseline method. The imputation achieved by the proposed framework enhances whole brain tractography, and therefore reduces the uncertainty when analyzing bundles associated with neurodegenerative. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.13846v1-abstract-full').style.display = 'none'; document.getElementById('2409.13846v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">20 pages; 8 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.13561</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Software Engineering">cs.SE</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huang%2C+J">Junjie Huang</a>, <a href="/search/cs?searchtype=author&query=Jiang%2C+Z">Zhihan Jiang</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+J">Jinyang Liu</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yintong Huo</a>, <a href="/search/cs?searchtype=author&query=Gu%2C+J">Jiazhen Gu</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+Z">Zhuangbin Chen</a>, <a href="/search/cs?searchtype=author&query=Feng%2C+C">Cong Feng</a>, <a href="/search/cs?searchtype=author&query=Dong%2C+H">Hui Dong</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+Z">Zengyin Yang</a>, <a href="/search/cs?searchtype=author&query=Lyu%2C+M+R">Michael R. Lyu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.13561v1-abstract-short" style="display: inline;"> Logs are imperative in the maintenance of online service systems, which often encompass important information for effective failure mitigation. While existing anomaly detection methodologies facilitate the identification of anomalous logs within extensive runtime data, manual investigation of log messages by engineers remains essential to comprehend faults, which is labor-intensive and error-prone… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.13561v1-abstract-full').style.display = 'inline'; document.getElementById('2409.13561v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.13561v1-abstract-full" style="display: none;"> Logs are imperative in the maintenance of online service systems, which often encompass important information for effective failure mitigation. While existing anomaly detection methodologies facilitate the identification of anomalous logs within extensive runtime data, manual investigation of log messages by engineers remains essential to comprehend faults, which is labor-intensive and error-prone. Upon examining the log-based troubleshooting practices at CloudA, we find that engineers typically prioritize two categories of log information for diagnosis. These include fault-indicating descriptions, which record abnormal system events, and fault-indicating parameters, which specify the associated entities. Motivated by this finding, we propose an approach to automatically extract such faultindicating information from logs for fault diagnosis, named LoFI. LoFI comprises two key stages. In the first stage, LoFI performs coarse-grained filtering to collect logs related to the faults based on semantic similarity. In the second stage, LoFI leverages a pre-trained language model with a novel prompt-based tuning method to extract fine-grained information of interest from the collected logs. We evaluate LoFI on logs collected from Apache Spark and an industrial dataset from CloudA. The experimental results demonstrate that LoFI outperforms all baseline methods by a significant margin, achieving an absolute improvement of 25.8~37.9 in F1 over the best baseline method, ChatGPT. This highlights the effectiveness of LoFI in recognizing fault-indicating information. Furthermore, the successful deployment of LoFI at CloudA and user studies validate the utility of our method. The code and data are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.13561v1-abstract-full').style.display = 'none'; document.getElementById('2409.13561v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">This paper has been accepted by the 35th IEEE International Symposium on Software Reliability Engineering (ISSRE'2024)</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.04851</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> AdaptiveFusion: Adaptive Multi-Modal Multi-View Fusion for 3D Human Body Reconstruction </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Chen%2C+A">Anjun Chen</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+X">Xiangyu Wang</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+Z">Zhi Xu</a>, <a href="/search/cs?searchtype=author&query=Shi%2C+K">Kun Shi</a>, <a href="/search/cs?searchtype=author&query=Qin%2C+Y">Yan Qin</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuchi Huo</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+J">Jiming Chen</a>, <a href="/search/cs?searchtype=author&query=Ye%2C+Q">Qi Ye</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.04851v1-abstract-short" style="display: inline;"> Recent advancements in sensor technology and deep learning have led to significant progress in 3D human body reconstruction. However, most existing approaches rely on data from a specific sensor, which can be unreliable due to the inherent limitations of individual sensing modalities. On the other hand, existing multi-modal fusion methods generally require customized designs based on the specific… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.04851v1-abstract-full').style.display = 'inline'; document.getElementById('2409.04851v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.04851v1-abstract-full" style="display: none;"> Recent advancements in sensor technology and deep learning have led to significant progress in 3D human body reconstruction. However, most existing approaches rely on data from a specific sensor, which can be unreliable due to the inherent limitations of individual sensing modalities. On the other hand, existing multi-modal fusion methods generally require customized designs based on the specific sensor combinations or setups, which limits the flexibility and generality of these methods. Furthermore, conventional point-image projection-based and Transformer-based fusion networks are susceptible to the influence of noisy modalities and sensor poses. To address these limitations and achieve robust 3D human body reconstruction in various conditions, we propose AdaptiveFusion, a generic adaptive multi-modal multi-view fusion framework that can effectively incorporate arbitrary combinations of uncalibrated sensor inputs. By treating different modalities from various viewpoints as equal tokens, and our handcrafted modality sampling module by leveraging the inherent flexibility of Transformer models, AdaptiveFusion is able to cope with arbitrary numbers of inputs and accommodate noisy modalities with only a single training network. Extensive experiments on large-scale human datasets demonstrate the effectiveness of AdaptiveFusion in achieving high-quality 3D human body reconstruction in various environments. In addition, our method achieves superior accuracy compared to state-of-the-art fusion methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.04851v1-abstract-full').style.display = 'none'; document.getElementById('2409.04851v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.04563</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Influence of Early through Late Fusion on Pancreas Segmentation from Imperfectly Registered Multimodal MRI </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Remedios%2C+L+W">Lucas W. Remedios</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+H">Han Liu</a>, <a href="/search/cs?searchtype=author&query=Remedios%2C+S+W">Samuel W. Remedios</a>, <a href="/search/cs?searchtype=author&query=Zuo%2C+L">Lianrui Zuo</a>, <a href="/search/cs?searchtype=author&query=Saunders%2C+A+M">Adam M. Saunders</a>, <a href="/search/cs?searchtype=author&query=Bao%2C+S">Shunxing Bao</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a>, <a href="/search/cs?searchtype=author&query=Powers%2C+A+C">Alvin C. Powers</a>, <a href="/search/cs?searchtype=author&query=Virostko%2C+J">John Virostko</a>, <a href="/search/cs?searchtype=author&query=Landman%2C+B+A">Bennett A. Landman</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.04563v1-abstract-short" style="display: inline;"> Multimodal fusion promises better pancreas segmentation. However, where to perform fusion in models is still an open question. It is unclear if there is a best location to fuse information when analyzing pairs of imperfectly aligned images. Two main alignment challenges in this pancreas segmentation study are 1) the pancreas is deformable and 2) breathing deforms the abdomen. Even after image regi… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.04563v1-abstract-full').style.display = 'inline'; document.getElementById('2409.04563v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.04563v1-abstract-full" style="display: none;"> Multimodal fusion promises better pancreas segmentation. However, where to perform fusion in models is still an open question. It is unclear if there is a best location to fuse information when analyzing pairs of imperfectly aligned images. Two main alignment challenges in this pancreas segmentation study are 1) the pancreas is deformable and 2) breathing deforms the abdomen. Even after image registration, relevant deformations are often not corrected. We examine how early through late fusion impacts pancreas segmentation. We used 353 pairs of T2-weighted (T2w) and T1-weighted (T1w) abdominal MR images from 163 subjects with accompanying pancreas labels. We used image registration (deeds) to align the image pairs. We trained a collection of basic UNets with different fusion points, spanning from early to late, to assess how early through late fusion influenced segmentation performance on imperfectly aligned images. We assessed generalization of fusion points on nnUNet. The single-modality T2w baseline using a basic UNet model had a Dice score of 0.73, while the same baseline on the nnUNet model achieved 0.80. For the basic UNet, the best fusion approach occurred in the middle of the encoder (early/mid fusion), which led to a statistically significant improvement of 0.0125 on Dice score compared to the baseline. For the nnUNet, the best fusion approach was na茂ve image concatenation before the model (early fusion), which resulted in a statistically significant Dice score increase of 0.0021 compared to baseline. Fusion in specific blocks can improve performance, but the best blocks for fusion are model specific, and the gains are small. In imperfectly registered datasets, fusion is a nuanced problem, with the art of design remaining vital for uncovering potential insights. Future innovation is needed to better address fusion in cases of imperfect alignment of abdominal image pairs. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.04563v1-abstract-full').style.display = 'none'; document.getElementById('2409.04563v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">13.5 pages of manuscript content</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.02816</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Physics and Society">physics.soc-ph</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computational Engineering, Finance, and Science">cs.CE</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Mathematical Physics">math-ph</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Adaptation and Self-Organizing Systems">nlin.AO</span> </div> </div> <p class="title is-5 mathjax"> Simple fusion-fission quantifies Israel-Palestine violence and suggests multi-adversary solution </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huo%2C+F+Y">Frank Yingjie Huo</a>, <a href="/search/cs?searchtype=author&query=Manrique%2C+P+D">Pedro D. Manrique</a>, <a href="/search/cs?searchtype=author&query=Restrepo%2C+D+J">Dylan J. Restrepo</a>, <a href="/search/cs?searchtype=author&query=Woo%2C+G">Gordon Woo</a>, <a href="/search/cs?searchtype=author&query=Johnson%2C+N+F">Neil F. Johnson</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.02816v2-abstract-short" style="display: inline;"> Why humans fight has no easy answer. However, understanding better how humans fight could inform future interventions, hidden shifts and casualty risk. Fusion-fission describes the well-known grouping behavior of fish etc. fighting for survival in the face of strong opponents: they form clusters ('fusion') which provide collective benefits and a cluster scatters when it senses danger ('fission').… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.02816v2-abstract-full').style.display = 'inline'; document.getElementById('2409.02816v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.02816v2-abstract-full" style="display: none;"> Why humans fight has no easy answer. However, understanding better how humans fight could inform future interventions, hidden shifts and casualty risk. Fusion-fission describes the well-known grouping behavior of fish etc. fighting for survival in the face of strong opponents: they form clusters ('fusion') which provide collective benefits and a cluster scatters when it senses danger ('fission'). Here we show how similar clustering (fusion-fission) of human fighters provides a unified quantitative explanation for complex casualty patterns across decades of Israel-Palestine region violence, as well as the October 7 surprise attack -- and uncovers a hidden post-October 7 shift. State-of-the-art data shows this fighter fusion-fission in action. It also predicts future 'super-shock' attacks that will be more lethal than October 7 and will arrive earlier. It offers a multi-adversary solution. Our results -- which include testable formulae and a plug-and-play simulation -- enable concrete risk assessments of future casualties and policy-making grounded by fighter behavior. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.02816v2-abstract-full').style.display = 'none'; document.getElementById('2409.02816v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 5 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 4 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Comments welcome. Working paper</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.01148</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> FMRFT: Fusion Mamba and DETR for Query Time Sequence Intersection Fish Tracking </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Yao%2C+M">Mingyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yukang Huo</a>, <a href="/search/cs?searchtype=author&query=Tian%2C+Q">Qingbin Tian</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+J">Jiayin Zhao</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+X">Xiao Liu</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+R">Ruifeng Wang</a>, <a href="/search/cs?searchtype=author&query=Xue%2C+L">Lin Xue</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+H">Haihua Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.01148v2-abstract-short" style="display: inline;"> Early detection of abnormal fish behavior caused by disease or hunger can be achieved through fish tracking using deep learning techniques, which holds significant value for industrial aquaculture. However, underwater reflections and some reasons with fish, such as the high similarity, rapid swimming caused by stimuli and mutual occlusion bring challenges to multi-target tracking of fish. To addre… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.01148v2-abstract-full').style.display = 'inline'; document.getElementById('2409.01148v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.01148v2-abstract-full" style="display: none;"> Early detection of abnormal fish behavior caused by disease or hunger can be achieved through fish tracking using deep learning techniques, which holds significant value for industrial aquaculture. However, underwater reflections and some reasons with fish, such as the high similarity, rapid swimming caused by stimuli and mutual occlusion bring challenges to multi-target tracking of fish. To address these challenges, this paper establishes a complex multi-scenario sturgeon tracking dataset and introduces the FMRFT model, a real-time end-to-end fish tracking solution. The model incorporates the low video memory consumption Mamba In Mamba (MIM) architecture, which facilitates multi-frame temporal memory and feature extraction, thereby addressing the challenges to track multiple fish across frames. Additionally, the FMRFT model with the Query Time Sequence Intersection (QTSI) module effectively manages occluded objects and reduces redundant tracking frames using the superior feature interaction and prior frame processing capabilities of RT-DETR. This combination significantly enhances the accuracy and stability of fish tracking. Trained and tested on the dataset, the model achieves an IDF1 score of 90.3% and a MOTA accuracy of 94.3%. Experimental results show that the proposed FMRFT model effectively addresses the challenges of high similarity and mutual occlusion in fish populations, enabling accurate tracking in factory farming environments. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.01148v2-abstract-full').style.display = 'none'; document.getElementById('2409.01148v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 2 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">14 pages,14 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.00388</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> A method for detecting dead fish on large water surfaces based on improved YOLOv10 </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Tian%2C+Q">Qingbin Tian</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yukang Huo</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+M">Mingyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+H">Haihua Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.00388v1-abstract-short" style="display: inline;"> Dead fish frequently appear on the water surface due to various factors. If not promptly detected and removed, these dead fish can cause significant issues such as water quality deterioration, ecosystem damage, and disease transmission. Consequently, it is imperative to develop rapid and effective detection methods to mitigate these challenges. Conventional methods for detecting dead fish are ofte… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.00388v1-abstract-full').style.display = 'inline'; document.getElementById('2409.00388v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.00388v1-abstract-full" style="display: none;"> Dead fish frequently appear on the water surface due to various factors. If not promptly detected and removed, these dead fish can cause significant issues such as water quality deterioration, ecosystem damage, and disease transmission. Consequently, it is imperative to develop rapid and effective detection methods to mitigate these challenges. Conventional methods for detecting dead fish are often constrained by manpower and time limitations, struggling to effectively manage the intricacies of aquatic environments. This paper proposes an end-to-end detection model built upon an enhanced YOLOv10 framework, designed specifically to swiftly and precisely detect deceased fish across extensive water surfaces.Key enhancements include: (1) Replacing YOLOv10's backbone network with FasterNet to reduce model complexity while maintaining high detection accuracy; (2) Improving feature fusion in the Neck section through enhanced connectivity methods and replacing the original C2f module with CSPStage modules; (3) Adding a compact target detection head to enhance the detection performance of smaller objects. Experimental results demonstrate significant improvements in P(precision), R(recall), and AP(average precision) compared to the baseline model YOLOv10n. Furthermore, our model outperforms other models in the YOLO series by significantly reducing model size and parameter count, while sustaining high inference speed and achieving optimal AP performance. The model facilitates rapid and accurate detection of dead fish in large-scale aquaculture systems. Finally, through ablation experiments, we systematically analyze and assess the contribution of each model component to the overall system performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.00388v1-abstract-full').style.display = 'none'; document.getElementById('2409.00388v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 31 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.16313</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> FA-YOLO: Research On Efficient Feature Selection YOLO Improved Algorithm Based On FMDS and AGMF Modules </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yukang Huo</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+M">Mingyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Tian%2C+Q">Qingbin Tian</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+T">Tonghao Wang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+R">Ruifeng Wang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+H">Haihua Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.16313v1-abstract-short" style="display: inline;"> Over the past few years, the YOLO series of models has emerged as one of the dominant methodologies in the realm of object detection. Many studies have advanced these baseline models by modifying their architectures, enhancing data quality, and developing new loss functions. However, current models still exhibit deficiencies in processing feature maps, such as overlooking the fusion of cross-scale… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.16313v1-abstract-full').style.display = 'inline'; document.getElementById('2408.16313v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.16313v1-abstract-full" style="display: none;"> Over the past few years, the YOLO series of models has emerged as one of the dominant methodologies in the realm of object detection. Many studies have advanced these baseline models by modifying their architectures, enhancing data quality, and developing new loss functions. However, current models still exhibit deficiencies in processing feature maps, such as overlooking the fusion of cross-scale features and a static fusion approach that lacks the capability for dynamic feature adjustment. To address these issues, this paper introduces an efficient Fine-grained Multi-scale Dynamic Selection Module (FMDS Module), which applies a more effective dynamic feature selection and fusion method on fine-grained multi-scale feature maps, significantly enhancing the detection accuracy of small, medium, and large-sized targets in complex environments. Furthermore, this paper proposes an Adaptive Gated Multi-branch Focus Fusion Module (AGMF Module), which utilizes multiple parallel branches to perform complementary fusion of various features captured by the gated unit branch, FMDS Module branch, and TripletAttention branch. This approach further enhances the comprehensiveness, diversity, and integrity of feature fusion. This paper has integrated the FMDS Module, AGMF Module, into Yolov9 to develop a novel object detection model named FA-YOLO. Extensive experimental results show that under identical experimental conditions, FA-YOLO achieves an outstanding 66.1% mean Average Precision (mAP) on the PASCAL VOC 2007 dataset, representing 1.0% improvement over YOLOv9's 65.1%. Additionally, the detection accuracies of FA-YOLO for small, medium, and large targets are 44.1%, 54.6%, and 70.8%, respectively, showing improvements of 2.0%, 3.1%, and 0.9% compared to YOLOv9's 42.1%, 51.5%, and 69.9%. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.16313v1-abstract-full').style.display = 'none'; document.getElementById('2408.16313v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 29 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">11 pages and 4 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.13389</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Quantum Physics">quant-ph</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Emerging Technologies">cs.ET</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1145/3676536.3676697 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> ReCon: Reconfiguring Analog Rydberg Atom Quantum Computers for Quantum Generative Adversarial Networks </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=DiBrita%2C+N+S">Nicholas S. DiBrita</a>, <a href="/search/cs?searchtype=author&query=Leeds%2C+D">Daniel Leeds</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuqian Huo</a>, <a href="/search/cs?searchtype=author&query=Ludmir%2C+J">Jason Ludmir</a>, <a href="/search/cs?searchtype=author&query=Patel%2C+T">Tirthak Patel</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.13389v2-abstract-short" style="display: inline;"> Quantum computing has shown theoretical promise of speedup in several machine learning tasks, including generative tasks using generative adversarial networks (GANs). While quantum computers have been implemented with different types of technologies, recently, analog Rydberg atom quantum computers have been demonstrated to have desirable properties such as reconfigurable qubit (quantum bit) positi… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.13389v2-abstract-full').style.display = 'inline'; document.getElementById('2408.13389v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.13389v2-abstract-full" style="display: none;"> Quantum computing has shown theoretical promise of speedup in several machine learning tasks, including generative tasks using generative adversarial networks (GANs). While quantum computers have been implemented with different types of technologies, recently, analog Rydberg atom quantum computers have been demonstrated to have desirable properties such as reconfigurable qubit (quantum bit) positions and multi-qubit operations. To leverage the properties of this technology, we propose ReCon, the first work to implement quantum GANs on analog Rydberg atom quantum computers. Our evaluation using simulations and real-computer executions shows 33% better quality (measured using Frechet Inception Distance (FID)) in generated images than the state-of-the-art technique implemented on superconducting-qubit technology. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.13389v2-abstract-full').style.display = 'none'; document.getElementById('2408.13389v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 23 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ReCon will appear in the Proceedings of the International Conference on Computer-Aided Design (ICCAD), 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.12109</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Wang%2C+C">Chenglong Wang</a>, <a href="/search/cs?searchtype=author&query=Gan%2C+Y">Yang Gan</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yifu Huo</a>, <a href="/search/cs?searchtype=author&query=Mu%2C+Y">Yongyu Mu</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+M">Murun Yang</a>, <a href="/search/cs?searchtype=author&query=He%2C+Q">Qiaozhi He</a>, <a href="/search/cs?searchtype=author&query=Xiao%2C+T">Tong Xiao</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+C">Chunliang Zhang</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+T">Tongran Liu</a>, <a href="/search/cs?searchtype=author&query=Du%2C+Q">Quan Du</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+D">Di Yang</a>, <a href="/search/cs?searchtype=author&query=Zhu%2C+J">Jingbo Zhu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.12109v1-abstract-short" style="display: inline;"> Large vision-language models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using human-preference alignment techniques, such as best-of-n sampling and reinforcement learning. However, these techniques face the difficulty arising from the sc… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.12109v1-abstract-full').style.display = 'inline'; document.getElementById('2408.12109v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.12109v1-abstract-full" style="display: none;"> Large vision-language models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using human-preference alignment techniques, such as best-of-n sampling and reinforcement learning. However, these techniques face the difficulty arising from the scarcity of visual preference data, which is required to train a visual reward model (VRM). In this work, we continue the line of research. We present a Robust Visual Reward Model (RoVRM) which improves human-preference alignment for LVLMs. RoVRM leverages auxiliary textual preference data through a three-phase progressive training and optimal transport-based preference data selection to effectively mitigate the scarcity of visual preference data. We experiment with RoVRM on the commonly used vision-language tasks based on the LLaVA-1.5-7B and -13B models. Experimental results demonstrate that RoVRM consistently outperforms traditional VRMs. Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.12109v1-abstract-full').style.display = 'none'; document.getElementById('2408.12109v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.11323</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Optimizing Transmit Field Inhomogeneity of Parallel RF Transmit Design in 7T MRI using Deep Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Lu%2C+Z">Zhengyi Lu</a>, <a href="/search/cs?searchtype=author&query=Liang%2C+H">Hao Liang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+X">Xiao Wang</a>, <a href="/search/cs?searchtype=author&query=Yan%2C+X">Xinqiang Yan</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.11323v1-abstract-short" style="display: inline;"> Ultrahigh field (UHF) Magnetic Resonance Imaging (MRI) provides a higher signal-to-noise ratio and, thereby, higher spatial resolution. However, UHF MRI introduces challenges such as transmit radiofrequency (RF) field (B1+) inhomogeneities, leading to uneven flip angles and image intensity anomalies. These issues can significantly degrade imaging quality and its medical applications. This study ad… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.11323v1-abstract-full').style.display = 'inline'; document.getElementById('2408.11323v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.11323v1-abstract-full" style="display: none;"> Ultrahigh field (UHF) Magnetic Resonance Imaging (MRI) provides a higher signal-to-noise ratio and, thereby, higher spatial resolution. However, UHF MRI introduces challenges such as transmit radiofrequency (RF) field (B1+) inhomogeneities, leading to uneven flip angles and image intensity anomalies. These issues can significantly degrade imaging quality and its medical applications. This study addresses B1+ field homogeneity through a novel deep learning-based strategy. Traditional methods like Magnitude Least Squares (MLS) optimization have been effective but are time-consuming and dependent on the patient's presence. Recent machine learning approaches, such as RF Shim Prediction by Iteratively Projected Ridge Regression and deep learning frameworks, have shown promise but face limitations like extensive training times and oversimplified architectures. We propose a two-step deep learning strategy. First, we obtain the desired reference RF shimming weights from multi-channel B1+ fields using random-initialized Adaptive Moment Estimation. Then, we employ Residual Networks (ResNets) to train a model that maps B1+ fields to target RF shimming outputs. Our approach does not rely on pre-calculated reference optimizations for the testing process and efficiently learns residual functions. Comparative studies with traditional MLS optimization demonstrate our method's advantages in terms of speed and accuracy. The proposed strategy achieves a faster and more efficient RF shimming design, significantly improving imaging quality at UHF. This advancement holds potential for broader applications in medical imaging and diagnostics. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.11323v1-abstract-full').style.display = 'none'; document.getElementById('2408.11323v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.09278</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Cross-Species Data Integration for Enhanced Layer Segmentation in Kidney Pathology </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhu%2C+J">Junchao Zhu</a>, <a href="/search/cs?searchtype=author&query=Yin%2C+M">Mengmeng Yin</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Long%2C+Y">Yitian Long</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Y">Yu Wang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Y">Yaohong Wang</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+S">Shilin Zhao</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+H">Haichun Yang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.09278v1-abstract-short" style="display: inline;"> Accurate delineation of the boundaries between the renal cortex and medulla is crucial for subsequent functional structural analysis and disease diagnosis. Training high-quality deep-learning models for layer segmentation relies on the availability of large amounts of annotated data. However, due to the patient's privacy of medical data and scarce clinical cases, constructing pathological datasets… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.09278v1-abstract-full').style.display = 'inline'; document.getElementById('2408.09278v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.09278v1-abstract-full" style="display: none;"> Accurate delineation of the boundaries between the renal cortex and medulla is crucial for subsequent functional structural analysis and disease diagnosis. Training high-quality deep-learning models for layer segmentation relies on the availability of large amounts of annotated data. However, due to the patient's privacy of medical data and scarce clinical cases, constructing pathological datasets from clinical sources is relatively difficult and expensive. Moreover, using external natural image datasets introduces noise during the domain generalization process. Cross-species homologous data, such as mouse kidney data, which exhibits high structural and feature similarity to human kidneys, has the potential to enhance model performance on human datasets. In this study, we incorporated the collected private Periodic Acid-Schiff (PAS) stained mouse kidney dataset into the human kidney dataset for joint training. The results showed that after introducing cross-species homologous data, the semantic segmentation models based on CNN and Transformer architectures achieved an average increase of 1.77% and 1.24% in mIoU, and 1.76% and 0.89% in Dice score for the human renal cortex and medulla datasets, respectively. This approach is also capable of enhancing the model's generalization ability. This indicates that cross-species homologous data, as a low-noise trainable data source, can help improve model performance under conditions of limited clinical samples. Code is available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.09278v1-abstract-full').style.display = 'none'; document.getElementById('2408.09278v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.07905</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Persistence Image from 3D Medical Image: Superpixel and Optimized Gaussian Coefficient </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhu%2C+Y">Yanfan Zhu</a>, <a href="/search/cs?searchtype=author&query=Singh%2C+Y">Yash Singh</a>, <a href="/search/cs?searchtype=author&query=Younis%2C+K">Khaled Younis</a>, <a href="/search/cs?searchtype=author&query=Bao%2C+S">Shunxing Bao</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.07905v1-abstract-short" style="display: inline;"> Topological data analysis (TDA) uncovers crucial properties of objects in medical imaging. Methods based on persistent homology have demonstrated their advantages in capturing topological features that traditional deep learning methods cannot detect in both radiology and pathology. However, previous research primarily focused on 2D image analysis, neglecting the comprehensive 3D context. In this p… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.07905v1-abstract-full').style.display = 'inline'; document.getElementById('2408.07905v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.07905v1-abstract-full" style="display: none;"> Topological data analysis (TDA) uncovers crucial properties of objects in medical imaging. Methods based on persistent homology have demonstrated their advantages in capturing topological features that traditional deep learning methods cannot detect in both radiology and pathology. However, previous research primarily focused on 2D image analysis, neglecting the comprehensive 3D context. In this paper, we propose an innovative 3D TDA approach that incorporates the concept of superpixels to transform 3D medical image features into point cloud data. By Utilizing Optimized Gaussian Coefficient, the proposed 3D TDA method, for the first time, efficiently generate holistic Persistence Images for 3D volumetric data. Our 3D TDA method exhibits superior performance on the MedMNist3D dataset when compared to other traditional methods, showcasing its potential effectiveness in modeling 3D persistent homology-based topological analysis when it comes to classification tasks. The source code is publicly available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.07905v1-abstract-full').style.display = 'none'; document.getElementById('2408.07905v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.06381</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Assessment of Cell Nuclei AI Foundation Models in Kidney Pathology </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Guo%2C+J">Junlin Guo</a>, <a href="/search/cs?searchtype=author&query=Lu%2C+S">Siqi Lu</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+C">Can Cui</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Tao%2C+Z">Zhewen Tao</a>, <a href="/search/cs?searchtype=author&query=Lin%2C+Y">Yizhe Lin</a>, <a href="/search/cs?searchtype=author&query=Lionts%2C+M">Marilyn Lionts</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Quan Liu</a>, <a href="/search/cs?searchtype=author&query=Xiong%2C+J">Juming Xiong</a>, <a href="/search/cs?searchtype=author&query=Chang%2C+C">Catie Chang</a>, <a href="/search/cs?searchtype=author&query=Wilkes%2C+M">Mitchell Wilkes</a>, <a href="/search/cs?searchtype=author&query=Yin%2C+M">Mengmeng Yin</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+H">Haichun Yang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.06381v1-abstract-short" style="display: inline;"> Cell nuclei instance segmentation is a crucial task in digital kidney pathology. Traditional automatic segmentation methods often lack generalizability when applied to unseen datasets. Recently, the success of foundation models (FMs) has provided a more generalizable solution, potentially enabling the segmentation of any cell type. In this study, we perform a large-scale evaluation of three widely… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.06381v1-abstract-full').style.display = 'inline'; document.getElementById('2408.06381v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.06381v1-abstract-full" style="display: none;"> Cell nuclei instance segmentation is a crucial task in digital kidney pathology. Traditional automatic segmentation methods often lack generalizability when applied to unseen datasets. Recently, the success of foundation models (FMs) has provided a more generalizable solution, potentially enabling the segmentation of any cell type. In this study, we perform a large-scale evaluation of three widely used state-of-the-art (SOTA) cell nuclei foundation models (Cellpose, StarDist, and CellViT). Specifically, we created a highly diverse evaluation dataset consisting of 2,542 kidney whole slide images (WSIs) collected from both human and rodent sources, encompassing various tissue types, sizes, and staining methods. To our knowledge, this is the largest-scale evaluation of its kind to date. Our quantitative analysis of the prediction distribution reveals a persistent performance gap in kidney pathology. Among the evaluated models, CellViT demonstrated superior performance in segmenting nuclei in kidney pathology. However, none of the foundation models are perfect; a performance gap remains in general nuclei segmentation for kidney pathology. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.06381v1-abstract-full').style.display = 'none'; document.getElementById('2408.06381v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.03464</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> AI Foundation Models in Remote Sensing: A Survey </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Lu%2C+S">Siqi Lu</a>, <a href="/search/cs?searchtype=author&query=Guo%2C+J">Junlin Guo</a>, <a href="/search/cs?searchtype=author&query=Zimmer-Dauphinee%2C+J+R">James R Zimmer-Dauphinee</a>, <a href="/search/cs?searchtype=author&query=Nieusma%2C+J+M">Jordan M Nieusma</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+X">Xiao Wang</a>, <a href="/search/cs?searchtype=author&query=VanValkenburgh%2C+P">Parker VanValkenburgh</a>, <a href="/search/cs?searchtype=author&query=Wernke%2C+S+A">Steven A Wernke</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.03464v1-abstract-short" style="display: inline;"> Artificial Intelligence (AI) technologies have profoundly transformed the field of remote sensing, revolutionizing data collection, processing, and analysis. Traditionally reliant on manual interpretation and task-specific models, remote sensing has been significantly enhanced by the advent of foundation models--large-scale, pre-trained AI models capable of performing a wide array of tasks with un… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.03464v1-abstract-full').style.display = 'inline'; document.getElementById('2408.03464v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.03464v1-abstract-full" style="display: none;"> Artificial Intelligence (AI) technologies have profoundly transformed the field of remote sensing, revolutionizing data collection, processing, and analysis. Traditionally reliant on manual interpretation and task-specific models, remote sensing has been significantly enhanced by the advent of foundation models--large-scale, pre-trained AI models capable of performing a wide array of tasks with unprecedented accuracy and efficiency. This paper provides a comprehensive survey of foundation models in the remote sensing domain, covering models released between June 2021 and June 2024. We categorize these models based on their applications in computer vision and domain-specific tasks, offering insights into their architectures, pre-training datasets, and methodologies. Through detailed performance comparisons, we highlight emerging trends and the significant advancements achieved by these foundation models. Additionally, we discuss the technical challenges, practical implications, and future research directions, addressing the need for high-quality data, computational resources, and improved model generalization. Our research also finds that pre-training methods, particularly self-supervised learning techniques like contrastive learning and masked autoencoders, significantly enhance the performance and robustness of foundation models in remote sensing tasks such as scene classification, object detection, and other applications. This survey aims to serve as a resource for researchers and practitioners by providing a panorama of advances and promising pathways for continued development and application of foundation models in remote sensing. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.03464v1-abstract-full').style.display = 'none'; document.getElementById('2408.03464v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2408.03101</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Software Engineering">cs.SE</span> </div> </div> <p class="title is-5 mathjax"> Automated Defects Detection and Fix in Logging Statement </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhong%2C+R">Renyi Zhong</a>, <a href="/search/cs?searchtype=author&query=Li%2C+Y">Yichen Li</a>, <a href="/search/cs?searchtype=author&query=Kuang%2C+J">Jinxi Kuang</a>, <a href="/search/cs?searchtype=author&query=Gu%2C+W">Wenwei Gu</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yintong Huo</a>, <a href="/search/cs?searchtype=author&query=Lyu%2C+M+R">Michael R. Lyu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2408.03101v1-abstract-short" style="display: inline;"> Developers use logging statements to monitor software, but misleading logs can complicate maintenance by obscuring actual activities. Existing research on logging quality issues is limited, mainly focusing on single defects and manual fixes. To address this, we conducted a study identifying four defect types in logging statements through real-world log changes analysis. We propose LogFixer, a two-… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.03101v1-abstract-full').style.display = 'inline'; document.getElementById('2408.03101v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2408.03101v1-abstract-full" style="display: none;"> Developers use logging statements to monitor software, but misleading logs can complicate maintenance by obscuring actual activities. Existing research on logging quality issues is limited, mainly focusing on single defects and manual fixes. To address this, we conducted a study identifying four defect types in logging statements through real-world log changes analysis. We propose LogFixer, a two-stage framework for automatic detection and updating of logging statements. In the offline stage, LogFixer uses a similarity-based classifier on synthetic defective logs to identify defects. During the online phase, this classifier evaluates logs in code snippets to determine necessary improvements, and an LLM-based recommendation framework suggests updates based on historical log changes. We evaluated LogFixer on real-world and synthetic datasets, and new real-world projects, achieving an F1 score of 0.625. LogFixer significantly improved static text and dynamic variables suggestions by 48.12\% and 24.90\%, respectively, and achieved a 61.49\% success rate in recommending correct updates for new projects. We reported 40 problematic logs to GitHub, resulting in 25 confirmed and merged changes across 11 projects. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2408.03101v1-abstract-full').style.display = 'none'; document.getElementById('2408.03101v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 August, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.18390</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Adapting Mouse Pathological Model to Human Glomerular Lesion Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Yu%2C+L">Lining Yu</a>, <a href="/search/cs?searchtype=author&query=Yin%2C+M">Mengmeng Yin</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Quan Liu</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+C">Can Cui</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Y">Yu Wang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Y">Yaohong Wang</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+S">Shilin Zhao</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+H">Haichun Yang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.18390v1-abstract-short" style="display: inline;"> Moving from animal models to human applications in preclinical research encompasses a broad spectrum of disciplines in medical science. A fundamental element in the development of new drugs, treatments, diagnostic methods, and in deepening our understanding of disease processes is the accurate measurement of kidney tissues. Past studies have demonstrated the viability of translating glomeruli segm… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.18390v1-abstract-full').style.display = 'inline'; document.getElementById('2407.18390v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.18390v1-abstract-full" style="display: none;"> Moving from animal models to human applications in preclinical research encompasses a broad spectrum of disciplines in medical science. A fundamental element in the development of new drugs, treatments, diagnostic methods, and in deepening our understanding of disease processes is the accurate measurement of kidney tissues. Past studies have demonstrated the viability of translating glomeruli segmentation techniques from mouse models to human applications. Yet, these investigations tend to neglect the complexities involved in segmenting pathological glomeruli affected by different lesions. Such lesions present a wider range of morphological variations compared to healthy glomerular tissue, which are arguably more valuable than normal glomeruli in clinical practice. Furthermore, data on lesions from animal models can be more readily scaled up from disease models and whole kidney biopsies. This brings up a question: ``\textit{Can a pathological segmentation model trained on mouse models be effectively applied to human patients?}" To answer this question, we introduced GLAM, a deep learning study for fine-grained segmentation of human kidney lesions using a mouse model, addressing mouse-to-human transfer learning, by evaluating different learning strategies for segmenting human pathological lesions using zero-shot transfer learning and hybrid learning by leveraging mouse samples. From the results, the hybrid learning model achieved superior performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.18390v1-abstract-full').style.display = 'none'; document.getElementById('2407.18390v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.14429</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Dataset Distillation in Medical Imaging: A Feasibility Study </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Li%2C+M">Muyang Li</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+C">Can Cui</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Quan Liu</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Lionts%2C+M">Marilyn Lionts</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.14429v1-abstract-short" style="display: inline;"> Data sharing in the medical image analysis field has potential yet remains underappreciated. The aim is often to share datasets efficiently with other sites to train models effectively. One possible solution is to avoid transferring the entire dataset while still achieving similar model performance. Recent progress in data distillation within computer science offers promising prospects for sharing… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.14429v1-abstract-full').style.display = 'inline'; document.getElementById('2407.14429v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.14429v1-abstract-full" style="display: none;"> Data sharing in the medical image analysis field has potential yet remains underappreciated. The aim is often to share datasets efficiently with other sites to train models effectively. One possible solution is to avoid transferring the entire dataset while still achieving similar model performance. Recent progress in data distillation within computer science offers promising prospects for sharing medical data efficiently without significantly compromising model effectiveness. However, it remains uncertain whether these methods would be applicable to medical imaging, since medical and natural images are distinct fields. Moreover, it is intriguing to consider what level of performance could be achieved with these methods. To answer these questions, we conduct investigations on a variety of leading data distillation methods, in different contexts of medical imaging. We evaluate the feasibility of these methods with extensive experiments in two aspects: 1) Assess the impact of data distillation across multiple datasets characterized by minor or great variations. 2) Explore the indicator to predict the distillation performance. Our extensive experiments across multiple medical datasets reveal that data distillation can significantly reduce dataset size while maintaining comparable model performance to that achieved with the full dataset, suggesting that a small, representative sample of images can serve as a reliable indicator of distillation success. This study demonstrates that data distillation is a viable method for efficient and secure medical data sharing, with the potential to facilitate enhanced collaborative research and clinical applications. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.14429v1-abstract-full').style.display = 'none'; document.getElementById('2407.14429v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 19 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.09979</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> PFPs: Prompt-guided Flexible Pathological Segmentation for Diverse Potential Outcomes Using Large Vision and Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Cui%2C+C">Can Cui</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Guo%2C+J">Junlin Guo</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Quan Liu</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+H">Haichun Yang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.09979v1-abstract-short" style="display: inline;"> The Vision Foundation Model has recently gained attention in medical image analysis. Its zero-shot learning capabilities accelerate AI deployment and enhance the generalizability of clinical applications. However, segmenting pathological images presents a special focus on the flexibility of segmentation targets. For instance, a single click on a Whole Slide Image (WSI) could signify a cell, a func… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.09979v1-abstract-full').style.display = 'inline'; document.getElementById('2407.09979v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.09979v1-abstract-full" style="display: none;"> The Vision Foundation Model has recently gained attention in medical image analysis. Its zero-shot learning capabilities accelerate AI deployment and enhance the generalizability of clinical applications. However, segmenting pathological images presents a special focus on the flexibility of segmentation targets. For instance, a single click on a Whole Slide Image (WSI) could signify a cell, a functional unit, or layers, adding layers of complexity to the segmentation tasks. Current models primarily predict potential outcomes but lack the flexibility needed for physician input. In this paper, we explore the potential of enhancing segmentation model flexibility by introducing various task prompts through a Large Language Model (LLM) alongside traditional task tokens. Our contribution is in four-fold: (1) we construct a computational-efficient pipeline that uses finetuned language prompts to guide flexible multi-class segmentation; (2) We compare segmentation performance with fixed prompts against free-text; (3) We design a multi-task kidney pathology segmentation dataset and the corresponding various free-text prompts; and (4) We evaluate our approach on the kidney pathology dataset, assessing its capacity to new cases during inference. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.09979v1-abstract-full').style.display = 'none'; document.getElementById('2407.09979v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.06116</a> <span> [<a href="">pdf</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Data-driven Nucleus Subclassification on Colon H&E using Style-transferred Digital Pathology </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Remedios%2C+L+W">Lucas W. Remedios</a>, <a href="/search/cs?searchtype=author&query=Bao%2C+S">Shunxing Bao</a>, <a href="/search/cs?searchtype=author&query=Remedios%2C+S+W">Samuel W. Remedios</a>, <a href="/search/cs?searchtype=author&query=Lee%2C+H+H">Ho Hin Lee</a>, <a href="/search/cs?searchtype=author&query=Cai%2C+L+Y">Leon Y. Cai</a>, <a href="/search/cs?searchtype=author&query=Li%2C+T">Thomas Li</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Newlin%2C+N+R">Nancy R. Newlin</a>, <a href="/search/cs?searchtype=author&query=Saunders%2C+A+M">Adam M. Saunders</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+C">Can Cui</a>, <a href="/search/cs?searchtype=author&query=Li%2C+J">Jia Li</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Qi Liu</a>, <a href="/search/cs?searchtype=author&query=Lau%2C+K+S">Ken S. Lau</a>, <a href="/search/cs?searchtype=author&query=Roland%2C+J+T">Joseph T. Roland</a>, <a href="/search/cs?searchtype=author&query=Washington%2C+M+K">Mary K Washington</a>, <a href="/search/cs?searchtype=author&query=Coburn%2C+L+A">Lori A. Coburn</a>, <a href="/search/cs?searchtype=author&query=Wilson%2C+K+T">Keith T. Wilson</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a>, <a href="/search/cs?searchtype=author&query=Landman%2C+B+A">Bennett A. Landman</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.06116v1-abstract-short" style="display: inline;"> Understanding the way cells communicate, co-locate, and interrelate is essential to furthering our understanding of how the body functions. H&E is widely available, however, cell subtyping often requires expert knowledge and the use of specialized stains. To reduce the annotation burden, AI has been proposed for the classification of cells on H&E. For example, the recent Colon Nucleus Identificati… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.06116v1-abstract-full').style.display = 'inline'; document.getElementById('2407.06116v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.06116v1-abstract-full" style="display: none;"> Understanding the way cells communicate, co-locate, and interrelate is essential to furthering our understanding of how the body functions. H&E is widely available, however, cell subtyping often requires expert knowledge and the use of specialized stains. To reduce the annotation burden, AI has been proposed for the classification of cells on H&E. For example, the recent Colon Nucleus Identification and Classification (CoNIC) Challenge focused on labeling 6 cell types on H&E of the colon. However, the CoNIC Challenge was unable to classify epithelial subtypes (progenitor, enteroendocrine, goblet), lymphocyte subtypes (B, helper T, cytotoxic T), and connective subtypes (fibroblasts). We use inter-modality learning to label previously un-labelable cell types on H&E. We take advantage of multiplexed immunofluorescence (MxIF) histology to label 14 cell subclasses. We performed style transfer on the same MxIF tissues to synthesize realistic virtual H&E which we paired with the MxIF-derived cell subclassification labels. We evaluated the efficacy of using a supervised learning scheme where the input was realistic-quality virtual H&E and the labels were MxIF-derived cell subclasses. We assessed our model on private virtual H&E and public real H&E. On virtual H&E, we were able to classify helper T cells and epithelial progenitors with positive predictive values of $0.34 \pm 0.15$ (prevalence $0.03 \pm 0.01$) and $0.47 \pm 0.1$ (prevalence $0.07 \pm 0.02$) respectively, when using ground truth centroid information. On real H&E we could classify helper T cells and epithelial progenitors with upper bound positive predictive values of $0.43 \pm 0.03$ (parent class prevalence 0.21) and $0.94 \pm 0.02$ (parent class prevalence 0.49) when using ground truth centroid information. This is the first work to provide cell type classification for helper T and epithelial progenitor nuclei on H&E. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.06116v1-abstract-full').style.display = 'none'; document.getElementById('2407.06116v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">arXiv admin note: text overlap with arXiv:2401.05602</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.05017</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Robotics">cs.RO</span> </div> </div> <p class="title is-5 mathjax"> VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Jiang%2C+X">Xuefeng Jiang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+F">Fangyuan Wang</a>, <a href="/search/cs?searchtype=author&query=Zheng%2C+R">Rongzhang Zheng</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+H">Han Liu</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yixiong Huo</a>, <a href="/search/cs?searchtype=author&query=Peng%2C+J">Jinzhang Peng</a>, <a href="/search/cs?searchtype=author&query=Tian%2C+L">Lu Tian</a>, <a href="/search/cs?searchtype=author&query=Barsoum%2C+E">Emad Barsoum</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.05017v1-abstract-short" style="display: inline;"> Precise localization is of great importance for autonomous parking task since it provides service for the downstream planning and control modules, which significantly affects the system performance. For parking scenarios, dynamic lighting, sparse textures, and the instability of global positioning system (GPS) signals pose challenges for most traditional localization methods. To address these diff… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.05017v1-abstract-full').style.display = 'inline'; document.getElementById('2407.05017v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.05017v1-abstract-full" style="display: none;"> Precise localization is of great importance for autonomous parking task since it provides service for the downstream planning and control modules, which significantly affects the system performance. For parking scenarios, dynamic lighting, sparse textures, and the instability of global positioning system (GPS) signals pose challenges for most traditional localization methods. To address these difficulties, we propose VIPS-Odom, a novel semantic visual-inertial odometry framework for underground autonomous parking, which adopts tightly-coupled optimization to fuse measurements from multi-modal sensors and solves odometry. Our VIPS-Odom integrates parking slots detected from the synthesized bird-eye-view (BEV) image with traditional feature points in the frontend, and conducts tightly-coupled optimization with joint constraints introduced by measurements from the inertial measurement unit, wheel speed sensor and parking slots in the backend. We develop a multi-object tracking framework to robustly track parking slots' states. To prove the superiority of our method, we equip an electronic vehicle with related sensors and build an experimental platform based on ROS2 system. Extensive experiments demonstrate the efficacy and advantages of our method compared with other baselines for parking scenarios. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.05017v1-abstract-full').style.display = 'none'; document.getElementById('2407.05017v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">A SLAM Method for Autonomous Parking</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.03307</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> HoloHisto: End-to-end Gigapixel WSI Segmentation with 4K Resolution Sequential Tokenization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Tang%2C+Y">Yucheng Tang</a>, <a href="/search/cs?searchtype=author&query=He%2C+Y">Yufan He</a>, <a href="/search/cs?searchtype=author&query=Nath%2C+V">Vishwesh Nath</a>, <a href="/search/cs?searchtype=author&query=Guo%2C+P">Pengfeig Guo</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Quan Liu</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+C">Can Cui</a>, <a href="/search/cs?searchtype=author&query=Yin%2C+M">Mengmeng Yin</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+Z">Ziyue Xu</a>, <a href="/search/cs?searchtype=author&query=Roth%2C+H">Holger Roth</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+D">Daguang Xu</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+H">Haichun Yang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.03307v1-abstract-short" style="display: inline;"> In digital pathology, the traditional method for deep learning-based image segmentation typically involves a two-stage process: initially segmenting high-resolution whole slide images (WSI) into smaller patches (e.g., 256x256, 512x512, 1024x1024) and subsequently reconstructing them to their original scale. This method often struggles to capture the complex details and vast scope of WSIs. In this… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.03307v1-abstract-full').style.display = 'inline'; document.getElementById('2407.03307v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.03307v1-abstract-full" style="display: none;"> In digital pathology, the traditional method for deep learning-based image segmentation typically involves a two-stage process: initially segmenting high-resolution whole slide images (WSI) into smaller patches (e.g., 256x256, 512x512, 1024x1024) and subsequently reconstructing them to their original scale. This method often struggles to capture the complex details and vast scope of WSIs. In this paper, we propose the holistic histopathology (HoloHisto) segmentation method to achieve end-to-end segmentation on gigapixel WSIs, whose maximum resolution is above 80,000$\times$70,000 pixels. HoloHisto fundamentally shifts the paradigm of WSI segmentation to an end-to-end learning fashion with 1) a large (4K) resolution base patch for elevated visual information inclusion and efficient processing, and 2) a novel sequential tokenization mechanism to properly model the contextual relationships and efficiently model the rich information from the 4K input. To our best knowledge, HoloHisto presents the first holistic approach for gigapixel resolution WSI segmentation, supporting direct I/O of complete WSI and their corresponding gigapixel masks. Under the HoloHisto platform, we unveil a random 4K sampler that transcends ultra-high resolution, delivering 31 and 10 times more pixels than standard 2D and 3D patches, respectively, for advancing computational capabilities. To facilitate efficient 4K resolution dense prediction, we leverage sequential tokenization, utilizing a pre-trained image tokenizer to group image features into a discrete token grid. To assess the performance, our team curated a new kidney pathology image segmentation (KPIs) dataset with WSI-level glomeruli segmentation from whole mouse kidneys. From the results, HoloHisto-4K delivers remarkable performance gains over previous state-of-the-art models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.03307v1-abstract-full').style.display = 'none'; document.getElementById('2407.03307v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 3 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.02386</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> OpenSlot: Mixed Open-set Recognition with Object-centric Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Yin%2C+X">Xu Yin</a>, <a href="/search/cs?searchtype=author&query=Pan%2C+F">Fei Pan</a>, <a href="/search/cs?searchtype=author&query=An%2C+G">Guoyuan An</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuchi Huo</a>, <a href="/search/cs?searchtype=author&query=Xie%2C+Z">Zixuan Xie</a>, <a href="/search/cs?searchtype=author&query=Yoon%2C+S">Sung-Eui Yoon</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.02386v1-abstract-short" style="display: inline;"> Existing open-set recognition (OSR) studies typically assume that each image contains only one class label, and the unknown test set (negative) has a disjoint label space from the known test set (positive), a scenario termed full-label shift. This paper introduces the mixed OSR problem, where test images contain multiple class semantics, with known and unknown classes co-occurring in negatives, le… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.02386v1-abstract-full').style.display = 'inline'; document.getElementById('2407.02386v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.02386v1-abstract-full" style="display: none;"> Existing open-set recognition (OSR) studies typically assume that each image contains only one class label, and the unknown test set (negative) has a disjoint label space from the known test set (positive), a scenario termed full-label shift. This paper introduces the mixed OSR problem, where test images contain multiple class semantics, with known and unknown classes co-occurring in negatives, leading to a more challenging super-label shift. Addressing the mixed OSR requires classification models to accurately distinguish different class semantics within images and measure their "knowness". In this study, we propose the OpenSlot framework, built upon object-centric learning. OpenSlot utilizes slot features to represent diverse class semantics and produce class predictions. Through our proposed anti-noise-slot (ANS) technique, we mitigate the impact of noise (invalid and background) slots during classification training, effectively addressing the semantic misalignment between class predictions and the ground truth. We conduct extensive experiments with OpenSlot on mixed & conventional OSR benchmarks. Without elaborate designs, OpenSlot not only exceeds existing OSR studies in detecting super-label shifts across single & multi-label mixed OSR tasks but also achieves state-of-the-art performance on conventional benchmarks. Remarkably, our method can localize class objects without using bounding boxes during training. The competitive performance in open-set object detection demonstrates OpenSlot's ability to explicitly explain label shifts and benefits in computational efficiency and generalization. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.02386v1-abstract-full').style.display = 'none'; document.getElementById('2407.02386v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">This study is under IEEE TMM review</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.00596</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> HATs: Hierarchical Adaptive Taxonomy Segmentation for Panoramic Pathology Image Analysis </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Quan Liu</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+C">Can Cui</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Xiong%2C+J">Juming Xiong</a>, <a href="/search/cs?searchtype=author&query=Bao%2C+S">Shunxing Bao</a>, <a href="/search/cs?searchtype=author&query=Li%2C+H">Hao Li</a>, <a href="/search/cs?searchtype=author&query=Yin%2C+M">Mengmeng Yin</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Y">Yu Wang</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+S">Shilin Zhao</a>, <a href="/search/cs?searchtype=author&query=Tang%2C+Y">Yucheng Tang</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+H">Haichun Yang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.00596v1-abstract-short" style="display: inline;"> Panoramic image segmentation in computational pathology presents a remarkable challenge due to the morphologically complex and variably scaled anatomy. For instance, the intricate organization in kidney pathology spans multiple layers, from regions like the cortex and medulla to functional units such as glomeruli, tubules, and vessels, down to various cell types. In this paper, we propose a novel… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.00596v1-abstract-full').style.display = 'inline'; document.getElementById('2407.00596v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.00596v1-abstract-full" style="display: none;"> Panoramic image segmentation in computational pathology presents a remarkable challenge due to the morphologically complex and variably scaled anatomy. For instance, the intricate organization in kidney pathology spans multiple layers, from regions like the cortex and medulla to functional units such as glomeruli, tubules, and vessels, down to various cell types. In this paper, we propose a novel Hierarchical Adaptive Taxonomy Segmentation (HATs) method, which is designed to thoroughly segment panoramic views of kidney structures by leveraging detailed anatomical insights. Our approach entails (1) the innovative HATs technique which translates spatial relationships among 15 distinct object classes into a versatile "plug-and-play" loss function that spans across regions, functional units, and cells, (2) the incorporation of anatomical hierarchies and scale considerations into a unified simple matrix representation for all panoramic entities, (3) the adoption of the latest AI foundation model (EfficientSAM) as a feature extraction tool to boost the model's adaptability, yet eliminating the need for manual prompt generation in conventional segment anything model (SAM). Experimental findings demonstrate that the HATs method offers an efficient and effective strategy for integrating clinical insights and imaging precedents into a unified segmentation model across more than 15 categories. The official implementation is publicly available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.00596v1-abstract-full').style.display = 'none'; document.getElementById('2407.00596v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">arXiv admin note: text overlap with arXiv:2402.19286</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.19540</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Weighted Circle Fusion: Ensembling Circle Representation from Different Object Detection Results </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Yue%2C+J">Jialin Yue</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Quan Liu</a>, <a href="/search/cs?searchtype=author&query=Xiong%2C+J">Juming Xiong</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+H">Haichun Yang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.19540v1-abstract-short" style="display: inline;"> Recently, the use of circle representation has emerged as a method to improve the identification of spherical objects (such as glomeruli, cells, and nuclei) in medical imaging studies. In traditional bounding box-based object detection, combining results from multiple models improves accuracy, especially when real-time processing isn't crucial. Unfortunately, this widely adopted strategy is not re… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.19540v1-abstract-full').style.display = 'inline'; document.getElementById('2406.19540v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.19540v1-abstract-full" style="display: none;"> Recently, the use of circle representation has emerged as a method to improve the identification of spherical objects (such as glomeruli, cells, and nuclei) in medical imaging studies. In traditional bounding box-based object detection, combining results from multiple models improves accuracy, especially when real-time processing isn't crucial. Unfortunately, this widely adopted strategy is not readily available for combining circle representations. In this paper, we propose Weighted Circle Fusion (WCF), a simple approach for merging predictions from various circle detection models. Our method leverages confidence scores associated with each proposed bounding circle to generate averaged circles. Our method undergoes thorough evaluation on a proprietary dataset for glomerular detection in object detection within whole slide imaging (WSI). The findings reveal a performance gain of 5 %, respectively, compared to existing ensemble methods. Furthermore, the Weighted Circle Fusion technique not only improves the precision of object detection in medical images but also notably decreases false detections, presenting a promising direction for future research and application in pathological image analysis. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.19540v1-abstract-full').style.display = 'none'; document.getElementById('2406.19540v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.16386</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Software Engineering">cs.SE</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Wan%2C+Y">Yuxuan Wan</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+C">Chaozheng Wang</a>, <a href="/search/cs?searchtype=author&query=Dong%2C+Y">Yi Dong</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+W">Wenxuan Wang</a>, <a href="/search/cs?searchtype=author&query=Li%2C+S">Shuqing Li</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yintong Huo</a>, <a href="/search/cs?searchtype=author&query=Lyu%2C+M+R">Michael R. Lyu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.16386v2-abstract-short" style="display: inline;"> Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.16386v2-abstract-full').style.display = 'inline'; document.getElementById('2406.16386v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.16386v2-abstract-full" style="display: none;"> Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore automatic design-to-code solutions, we first conduct a motivating study on GPT-4o and identify three types of issues in generating UI code: element omission, element distortion, and element misarrangement. We further reveal that a focus on smaller visual segments can help multimodal large language models (MLLMs) mitigate these failures in the generation process. In this paper, we propose DCGen, a divide-and-conquer-based approach to automate the translation of webpage design to UI code. DCGen starts by dividing screenshots into manageable segments, generating descriptions for each segment, and then reassembling them into complete UI code for the entire screenshot. We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 14% improvement in visual similarity over competing methods. To the best of our knowledge, DCGen is the first segment-aware prompt-based approach for generating UI code directly from screenshots. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.16386v2-abstract-full').style.display = 'none'; document.getElementById('2406.16386v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 24 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.16360</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Graphics">cs.GR</span> </div> </div> <p class="title is-5 mathjax"> MIRReS: Multi-bounce Inverse Rendering using Reservoir Sampling </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Dai%2C+Y">Yuxin Dai</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+Q">Qi Wang</a>, <a href="/search/cs?searchtype=author&query=Zhu%2C+J">Jingsen Zhu</a>, <a href="/search/cs?searchtype=author&query=Xi%2C+D">Dianbing Xi</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuchi Huo</a>, <a href="/search/cs?searchtype=author&query=Qian%2C+C">Chen Qian</a>, <a href="/search/cs?searchtype=author&query=He%2C+Y">Ying He</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.16360v2-abstract-short" style="display: inline;"> We present MIRReS, a novel two-stage inverse rendering framework that jointly reconstructs and optimizes the explicit geometry, material, and lighting from multi-view images. Unlike previous methods that rely on implicit irradiance fields or simplified path tracing algorithms, our method extracts an explicit geometry (triangular mesh) in stage one, and introduces a more realistic physically-based… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.16360v2-abstract-full').style.display = 'inline'; document.getElementById('2406.16360v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.16360v2-abstract-full" style="display: none;"> We present MIRReS, a novel two-stage inverse rendering framework that jointly reconstructs and optimizes the explicit geometry, material, and lighting from multi-view images. Unlike previous methods that rely on implicit irradiance fields or simplified path tracing algorithms, our method extracts an explicit geometry (triangular mesh) in stage one, and introduces a more realistic physically-based inverse rendering model that utilizes multi-bounce path tracing and Monte Carlo integration. By leveraging multi-bounce path tracing, our method effectively estimates indirect illumination, including self-shadowing and internal reflections, which improves the intrinsic decomposition of shape, material, and lighting. Moreover, we incorporate reservoir sampling into our framework to address the noise in Monte Carlo integration, enhancing convergence and facilitating gradient-based optimization with low sample counts. Through qualitative and quantitative evaluation of several scenarios, especially in challenging scenarios with complex shadows, we demonstrate that our method achieves state-of-the-art performance on decomposition results. Additionally, our optimized explicit geometry enables applications such as scene editing, relighting, and material editing with modern graphics engines or CAD software. The source code is available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.16360v2-abstract-full').style.display = 'none'; document.getElementById('2406.16360v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 24 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">16 pages, 14 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.15755</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Fine-grained Background Representation for Weakly Supervised Semantic Segmentation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Yin%2C+X">Xu Yin</a>, <a href="/search/cs?searchtype=author&query=Im%2C+W">Woobin Im</a>, <a href="/search/cs?searchtype=author&query=Min%2C+D">Dongbo Min</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuchi Huo</a>, <a href="/search/cs?searchtype=author&query=Pan%2C+F">Fei Pan</a>, <a href="/search/cs?searchtype=author&query=Yoon%2C+S">Sung-Eui Yoon</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.15755v1-abstract-short" style="display: inline;"> Generating reliable pseudo masks from image-level labels is challenging in the weakly supervised semantic segmentation (WSSS) task due to the lack of spatial information. Prevalent class activation map (CAM)-based solutions are challenged to discriminate the foreground (FG) objects from the suspicious background (BG) pixels (a.k.a. co-occurring) and learn the integral object regions. This paper pr… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.15755v1-abstract-full').style.display = 'inline'; document.getElementById('2406.15755v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.15755v1-abstract-full" style="display: none;"> Generating reliable pseudo masks from image-level labels is challenging in the weakly supervised semantic segmentation (WSSS) task due to the lack of spatial information. Prevalent class activation map (CAM)-based solutions are challenged to discriminate the foreground (FG) objects from the suspicious background (BG) pixels (a.k.a. co-occurring) and learn the integral object regions. This paper proposes a simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics and address the co-occurring problems. We abandon using the class prototype or pixel-level features for BG representation. Instead, we develop a novel primitive, negative region of interest (NROI), to capture the fine-grained BG semantic information and conduct the pixel-to-NROI contrast to distinguish the confusing BG pixels. We also present an active sampling strategy to mine the FG negatives on-the-fly, enabling efficient pixel-to-pixel intra-foreground contrastive learning to activate the entire object region. Thanks to the simplicity of design and convenience in use, our proposed method can be seamlessly plugged into various models, yielding new state-of-the-art results under various WSSS settings across benchmarks. Leveraging solely image-level (I) labels as supervision, our method achieves 73.2 mIoU and 45.6 mIoU segmentation results on Pascal Voc and MS COCO test sets, respectively. Furthermore, by incorporating saliency maps as an additional supervision signal (I+S), we attain 74.9 mIoU on Pascal Voc test set. Concurrently, our FBR approach demonstrates meaningful performance gains in weakly-supervised instance segmentation (WSIS) tasks, showcasing its robustness and strong generalization capabilities across diverse domains. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.15755v1-abstract-full').style.display = 'none'; document.getElementById('2406.15755v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 22 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.14129</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Multimedia">cs.MM</span> </div> </div> <p class="title is-5 mathjax"> Towards Event-oriented Long Video Understanding </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Du%2C+Y">Yifan Du</a>, <a href="/search/cs?searchtype=author&query=Zhou%2C+K">Kun Zhou</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuqi Huo</a>, <a href="/search/cs?searchtype=author&query=Li%2C+Y">Yifan Li</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+W+X">Wayne Xin Zhao</a>, <a href="/search/cs?searchtype=author&query=Lu%2C+H">Haoyu Lu</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+Z">Zijia Zhao</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+B">Bingning Wang</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+W">Weipeng Chen</a>, <a href="/search/cs?searchtype=author&query=Wen%2C+J">Ji-Rong Wen</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.14129v1-abstract-short" style="display: inline;"> With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.14129v1-abstract-full').style.display = 'inline'; document.getElementById('2406.14129v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.14129v1-abstract-full" style="display: none;"> With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33, significantly outperforming the best open-source model by 41.42%. Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.14129v1-abstract-full').style.display = 'none'; document.getElementById('2406.14129v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Work on progress</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.12254</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Enhancing Single-Slice Segmentation with 3D-to-2D Unpaired Scan Distillation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Yu%2C+X">Xin Yu</a>, <a href="/search/cs?searchtype=author&query=Yang%2C+Q">Qi Yang</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+H">Han Liu</a>, <a href="/search/cs?searchtype=author&query=Lee%2C+H+H">Ho Hin Lee</a>, <a href="/search/cs?searchtype=author&query=Tang%2C+Y">Yucheng Tang</a>, <a href="/search/cs?searchtype=author&query=Remedios%2C+L+W">Lucas W. Remedios</a>, <a href="/search/cs?searchtype=author&query=Kim%2C+M+E">Michael E. Kim</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+R">Rendong Zhang</a>, <a href="/search/cs?searchtype=author&query=Bao%2C+S">Shunxing Bao</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a>, <a href="/search/cs?searchtype=author&query=Moore%2C+A+Z">Ann Zenobia Moore</a>, <a href="/search/cs?searchtype=author&query=Ferrucci%2C+L">Luigi Ferrucci</a>, <a href="/search/cs?searchtype=author&query=Landman%2C+B+A">Bennett A. Landman</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.12254v2-abstract-short" style="display: inline;"> 2D single-slice abdominal computed tomography (CT) enables the assessment of body habitus and organ health with low radiation exposure. However, single-slice data necessitates the use of 2D networks for segmentation, but these networks often struggle to capture contextual information effectively. Consequently, even when trained on identical datasets, 3D networks typically achieve superior segmenta… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.12254v2-abstract-full').style.display = 'inline'; document.getElementById('2406.12254v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.12254v2-abstract-full" style="display: none;"> 2D single-slice abdominal computed tomography (CT) enables the assessment of body habitus and organ health with low radiation exposure. However, single-slice data necessitates the use of 2D networks for segmentation, but these networks often struggle to capture contextual information effectively. Consequently, even when trained on identical datasets, 3D networks typically achieve superior segmentation results. In this work, we propose a novel 3D-to-2D distillation framework, leveraging pre-trained 3D models to enhance 2D single-slice segmentation. Specifically, we extract the prediction distribution centroid from the 3D representations, to guide the 2D student by learning intra- and inter-class correlation. Unlike traditional knowledge distillation methods that require the same data input, our approach employs unpaired 3D CT scans with any contrast to guide the 2D student model. Experiments conducted on 707 subjects from the single-slice Baltimore Longitudinal Study of Aging (BLSA) dataset demonstrate that state-of-the-art 2D multi-organ segmentation methods can benefit from the 3D teacher model, achieving enhanced performance in single-slice multi-organ segmentation. Notably, our approach demonstrates considerable efficacy in low-data regimes, outperforming the model trained with all available training subjects even when utilizing only 200 training subjects. Thus, this work underscores the potential to alleviate manual annotation burdens. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.12254v2-abstract-full').style.display = 'none'; document.getElementById('2406.12254v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 18 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.11317</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Human-Computer Interaction">cs.HC</span> </div> </div> <p class="title is-5 mathjax"> GUICourse: From General Vision Language Models to Versatile GUI Agents </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Chen%2C+W">Wentong Chen</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+J">Junbo Cui</a>, <a href="/search/cs?searchtype=author&query=Hu%2C+J">Jinyi Hu</a>, <a href="/search/cs?searchtype=author&query=Qin%2C+Y">Yujia Qin</a>, <a href="/search/cs?searchtype=author&query=Fang%2C+J">Junjie Fang</a>, <a href="/search/cs?searchtype=author&query=Zhao%2C+Y">Yue Zhao</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+C">Chongyi Wang</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+J">Jun Liu</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+G">Guirong Chen</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yupeng Huo</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+Y">Yuan Yao</a>, <a href="/search/cs?searchtype=author&query=Lin%2C+Y">Yankai Lin</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+Z">Zhiyuan Liu</a>, <a href="/search/cs?searchtype=author&query=Sun%2C+M">Maosong Sun</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.11317v1-abstract-short" style="display: inline;"> Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (th… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.11317v1-abstract-full').style.display = 'inline'; document.getElementById('2406.11317v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.11317v1-abstract-full" style="display: none;"> Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.11317v1-abstract-full').style.display = 'none'; document.getElementById('2406.11317v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.11242</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Accurate and Fast Pixel Retrieval with Spatial and Uncertainty Aware Hypergraph Diffusion </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=An%2C+G">Guoyuan An</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuchi Huo</a>, <a href="/search/cs?searchtype=author&query=Yoon%2C+S">Sung-Eui Yoon</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.11242v1-abstract-short" style="display: inline;"> This paper presents a novel method designed to enhance the efficiency and accuracy of both image retrieval and pixel retrieval. Traditional diffusion methods struggle to propagate spatial information effectively in conventional graphs due to their reliance on scalar edge weights. To overcome this limitation, we introduce a hypergraph-based framework, uniquely capable of efficiently propagating spa… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.11242v1-abstract-full').style.display = 'inline'; document.getElementById('2406.11242v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.11242v1-abstract-full" style="display: none;"> This paper presents a novel method designed to enhance the efficiency and accuracy of both image retrieval and pixel retrieval. Traditional diffusion methods struggle to propagate spatial information effectively in conventional graphs due to their reliance on scalar edge weights. To overcome this limitation, we introduce a hypergraph-based framework, uniquely capable of efficiently propagating spatial information using local features during query time, thereby accurately retrieving and localizing objects within a database. Additionally, we innovatively utilize the structural information of the image graph through a technique we term "community selection". This approach allows for the assessment of the initial search result's uncertainty and facilitates an optimal balance between accuracy and speed. This is particularly crucial in real-world applications where such trade-offs are often necessary. Our experimental results, conducted on the (P)ROxford and (P)RParis datasets, demonstrate the significant superiority of our method over existing diffusion techniques. We achieve state-of-the-art (SOTA) accuracy in both image-level and pixel-level retrieval, while also maintaining impressive processing speed. This dual achievement underscores the effectiveness of our hypergraph-based framework and community selection technique, marking a notable advancement in the field of content-based image retrieval. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.11242v1-abstract-full').style.display = 'none'; document.getElementById('2406.11242v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.09367</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Zhao%2C+Z">Zijia Zhao</a>, <a href="/search/cs?searchtype=author&query=Lu%2C+H">Haoyu Lu</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuqi Huo</a>, <a href="/search/cs?searchtype=author&query=Du%2C+Y">Yifan Du</a>, <a href="/search/cs?searchtype=author&query=Yue%2C+T">Tongtian Yue</a>, <a href="/search/cs?searchtype=author&query=Guo%2C+L">Longteng Guo</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+B">Bingning Wang</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+W">Weipeng Chen</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+J">Jing Liu</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.09367v2-abstract-short" style="display: inline;"> Video understanding is a crucial next step for multimodal large language models (MLLMs). Various benchmarks are introduced for better evaluating the MLLMs. Nevertheless, current video benchmarks are still inefficient for evaluating video models during iterative development due to the high cost of constructing datasets and the difficulty in isolating specific skills. In this paper, we propose Video… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.09367v2-abstract-full').style.display = 'inline'; document.getElementById('2406.09367v2-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.09367v2-abstract-full" style="display: none;"> Video understanding is a crucial next step for multimodal large language models (MLLMs). Various benchmarks are introduced for better evaluating the MLLMs. Nevertheless, current video benchmarks are still inefficient for evaluating video models during iterative development due to the high cost of constructing datasets and the difficulty in isolating specific skills. In this paper, we propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. VideoNIAH decouples video content from their query-responses by inserting unrelated visual 'needles' into original videos. The framework automates the generation of query-response pairs using predefined rules, minimizing manual labor. The queries focus on specific aspects of video understanding, enabling more skill-specific evaluations. The separation between video content and the queries also allow for increased video variety and evaluations across different lengths. Utilizing VideoNIAH, we compile a video benchmark VNBench, which includes tasks such as retrieval, ordering, and counting to evaluate three key aspects of video understanding: temporal perception, chronological ordering, and spatio-temporal coherence. We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities across various tasks. Additionally, we perform an in-depth analysis of the test results and model configurations. Based on these findings, we provide some advice for improving video MLLM training, offering valuable insights to guide future research and model development. The code and data are available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.09367v2-abstract-full').style.display = 'none'; document.getElementById('2406.09367v2-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 13 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.02430</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Audio and Speech Processing">eess.AS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Sound">cs.SD</span> </div> </div> <p class="title is-5 mathjax"> Seed-TTS: A Family of High-Quality Versatile Speech Generation Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Anastassiou%2C+P">Philip Anastassiou</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+J">Jiawei Chen</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+J">Jitong Chen</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+Y">Yuanzhe Chen</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+Z">Zhuo Chen</a>, <a href="/search/cs?searchtype=author&query=Chen%2C+Z">Ziyi Chen</a>, <a href="/search/cs?searchtype=author&query=Cong%2C+J">Jian Cong</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+L">Lelai Deng</a>, <a href="/search/cs?searchtype=author&query=Ding%2C+C">Chuang Ding</a>, <a href="/search/cs?searchtype=author&query=Gao%2C+L">Lu Gao</a>, <a href="/search/cs?searchtype=author&query=Gong%2C+M">Mingqing Gong</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+P">Peisong Huang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+Q">Qingqing Huang</a>, <a href="/search/cs?searchtype=author&query=Huang%2C+Z">Zhiying Huang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuanyuan Huo</a>, <a href="/search/cs?searchtype=author&query=Jia%2C+D">Dongya Jia</a>, <a href="/search/cs?searchtype=author&query=Li%2C+C">Chumin Li</a>, <a href="/search/cs?searchtype=author&query=Li%2C+F">Feiya Li</a>, <a href="/search/cs?searchtype=author&query=Li%2C+H">Hui Li</a>, <a href="/search/cs?searchtype=author&query=Li%2C+J">Jiaxin Li</a>, <a href="/search/cs?searchtype=author&query=Li%2C+X">Xiaoyang Li</a>, <a href="/search/cs?searchtype=author&query=Li%2C+X">Xingxing Li</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+L">Lin Liu</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+S">Shouda Liu</a>, <a href="/search/cs?searchtype=author&query=Liu%2C+S">Sichao Liu</a> , et al. (21 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.02430v1-abstract-short" style="display: inline;"> We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.02430v1-abstract-full').style.display = 'inline'; document.getElementById('2406.02430v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.02430v1-abstract-full" style="display: none;"> We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $\text{Seed-TTS}_\text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $\text{Seed-TTS}_\text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{}. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.02430v1-abstract-full').style.display = 'none'; document.getElementById('2406.02430v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2405.17824</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Quan Liu</a>, <a href="/search/cs?searchtype=author&query=Deng%2C+R">Ruining Deng</a>, <a href="/search/cs?searchtype=author&query=Cui%2C+C">Can Cui</a>, <a href="/search/cs?searchtype=author&query=Yao%2C+T">Tianyuan Yao</a>, <a href="/search/cs?searchtype=author&query=Nath%2C+V">Vishwesh Nath</a>, <a href="/search/cs?searchtype=author&query=Tang%2C+Y">Yucheng Tang</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.17824v1-abstract-short" style="display: inline;"> Multi-modal learning adeptly integrates visual and textual data, but its application to histopathology image and text analysis remains challenging, particularly with large, high-resolution images like gigapixel Whole Slide Images (WSIs). Current methods typically rely on manual region labeling or multi-stage learning to assemble local representations (e.g., patch-level) into global features (e.g.,… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.17824v1-abstract-full').style.display = 'inline'; document.getElementById('2405.17824v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.17824v1-abstract-full" style="display: none;"> Multi-modal learning adeptly integrates visual and textual data, but its application to histopathology image and text analysis remains challenging, particularly with large, high-resolution images like gigapixel Whole Slide Images (WSIs). Current methods typically rely on manual region labeling or multi-stage learning to assemble local representations (e.g., patch-level) into global features (e.g., slide-level). However, there is no effective way to integrate multi-scale image representations with text data in a seamless end-to-end process. In this study, we introduce Multi-Level Text-Guided Representation End-to-End Learning (mTREE). This novel text-guided approach effectively captures multi-scale WSI representations by utilizing information from accompanying textual pathology information. mTREE innovatively combines - the localization of key areas (global-to-local) and the development of a WSI-level image-text representation (local-to-global) - into a unified, end-to-end learning framework. In this model, textual information serves a dual purpose: firstly, functioning as an attention map to accurately identify key areas, and secondly, acting as a conduit for integrating textual features into the comprehensive representation of the image. Our study demonstrates the effectiveness of mTREE through quantitative analyses in two image-related tasks: classification and survival prediction, showcasing its remarkable superiority over baselines. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.17824v1-abstract-full').style.display = 'none'; document.getElementById('2405.17824v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2405.17568</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> </div> </div> <p class="title is-5 mathjax"> ExtremeMETA: High-speed Lightweight Image Segmentation Model by Remodeling Multi-channel Metamaterial Imagers </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Liu%2C+Q">Quan Liu</a>, <a href="/search/cs?searchtype=author&query=Swartz%2C+B+T">Brandon T. Swartz</a>, <a href="/search/cs?searchtype=author&query=Kravchenko%2C+I">Ivan Kravchenko</a>, <a href="/search/cs?searchtype=author&query=Valentine%2C+J+G">Jason G. Valentine</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yuankai Huo</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.17568v1-abstract-short" style="display: inline;"> Deep neural networks (DNNs) have heavily relied on traditional computational units like CPUs and GPUs. However, this conventional approach brings significant computational burdens, latency issues, and high power consumption, limiting their effectiveness. This has sparked the need for lightweight networks like ExtremeC3Net. On the other hand, there have been notable advancements in optical computat… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.17568v1-abstract-full').style.display = 'inline'; document.getElementById('2405.17568v1-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.17568v1-abstract-full" style="display: none;"> Deep neural networks (DNNs) have heavily relied on traditional computational units like CPUs and GPUs. However, this conventional approach brings significant computational burdens, latency issues, and high power consumption, limiting their effectiveness. This has sparked the need for lightweight networks like ExtremeC3Net. On the other hand, there have been notable advancements in optical computational units, particularly with metamaterials, offering the exciting prospect of energy-efficient neural networks operating at the speed of light. Yet, the digital design of metamaterial neural networks (MNNs) faces challenges such as precision, noise, and bandwidth, limiting their application to intuitive tasks and low-resolution images. In this paper, we propose a large kernel lightweight segmentation model, ExtremeMETA. Based on the ExtremeC3Net, the ExtremeMETA maximizes the ability of the first convolution layer by exploring a larger convolution kernel and multiple processing paths. With the proposed large kernel convolution model, we extend the optic neural network application boundary to the segmentation task. To further lighten the computation burden of the digital processing part, a set of model compression methods is applied to improve model efficiency in the inference stage. The experimental results on three publicly available datasets demonstrate that the optimized efficient design improved segmentation performance from 92.45 to 95.97 on mIoU while reducing computational FLOPs from 461.07 MMacs to 166.03 MMacs. The proposed the large kernel lightweight model ExtremeMETA showcases the hybrid design's ability on complex tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.17568v1-abstract-full').style.display = 'none'; document.getElementById('2405.17568v1-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2405.16141</a> <span> [<a href="">pdf</a>, <a href="">other</a>] </span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computational Engineering, Finance, and Science">cs.CE</span> </div> </div> <p class="title is-5 mathjax"> AIGB: Generative Auto-bidding via Conditional Diffusion Modeling </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&query=Guo%2C+J">Jiayan Guo</a>, <a href="/search/cs?searchtype=author&query=Huo%2C+Y">Yusen Huo</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+Z">Zhilin Zhang</a>, <a href="/search/cs?searchtype=author&query=Wang%2C+T">Tianyu Wang</a>, <a href="/search/cs?searchtype=author&query=Yu%2C+C">Chuan Yu</a>, <a href="/search/cs?searchtype=author&query=Xu%2C+J">Jian Xu</a>, <a href="/search/cs?searchtype=author&query=Zhang%2C+Y">Yan Zhang</a>, <a href="/search/cs?searchtype=author&query=Zheng%2C+B">Bo Zheng</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.16141v4-abstract-short" style="display: inline;"> Auto-bidding plays a crucial role in facilitating online advertising by automatically providing bids for advertisers. Reinforcement learning (RL) has gained popularity for auto-bidding. However, most current RL auto-bidding methods are modeled through the Markovian Decision Process (MDP), which assumes the Markovian state transition. This assumption restricts the ability to perform in long horizon… <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.16141v4-abstract-full').style.display = 'inline'; document.getElementById('2405.16141v4-abstract-short').style.display = 'none';">▽ More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.16141v4-abstract-full" style="display: none;"> Auto-bidding plays a crucial role in facilitating online advertising by automatically providing bids for advertisers. Reinforcement learning (RL) has gained popularity for auto-bidding. However, most current RL auto-bidding methods are modeled through the Markovian Decision Process (MDP), which assumes the Markovian state transition. This assumption restricts the ability to perform in long horizon scenarios and makes the model unstable when dealing with highly random online advertising environments. To tackle this issue, this paper introduces AI-Generated Bidding (AIGB), a novel paradigm for auto-bidding through generative modeling. In this paradigm, we propose DiffBid, a conditional diffusion modeling approach for bid generation. DiffBid directly models the correlation between the return and the entire trajectory, effectively avoiding error propagation across time steps in long horizons. Additionally, DiffBid offers a versatile approach for generating trajectories that maximize given targets while adhering to specific constraints. Extensive experiments conducted on the real-world dataset and online A/B test on Alibaba advertising platform demonstrate the effectiveness of DiffBid, achieving 2.81% increase in GMV and 3.36% increase in ROI. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.16141v4-abstract-full').style.display = 'none'; document.getElementById('2405.16141v4-abstract-short').style.display = 'inline';">△ Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 8 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 25 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted by KDD 2024</span> </p> </li> </ol> <nav class="pagination is-small is-centered 