" aria-label="Page 5" aria-current="page">5 </a> </li> </ul> </nav> <ol class="breathe-horizontal" start="1"> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.12136</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Visualizing Loss Functions as Topological Landscape Profiles </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Geniesse%2C+C">Caleb Geniesse</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+J">Jiaqing Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Xie%2C+T">Tiankai Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Shi%2C+G">Ge Shi</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yaoqing Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Morozov%2C+D">Dmitriy Morozov</a>, <a href="/search/cs?searchtype=author&amp;query=Perciano%2C+T">Talita Perciano</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Maciejewski%2C+R">Ross Maciejewski</a>, <a href="/search/cs?searchtype=author&amp;query=Weber%2C+G+H">Gunther H. Weber</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.12136v1-abstract-short" style="display: inline;"> In machine learning, a loss function measures the difference between model predictions and ground-truth (or target) values. For neural network models, visualizing how this loss changes as model parameters are varied can provide insights into the local structure of the so-called loss landscape (e.g., smoothness) as well as global properties of the underlying model (e.g., generalization performance)&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12136v1-abstract-full').style.display = 'inline'; document.getElementById('2411.12136v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.12136v1-abstract-full" style="display: none;"> In machine learning, a loss function measures the difference between model predictions and ground-truth (or target) values. For neural network models, visualizing how this loss changes as model parameters are varied can provide insights into the local structure of the so-called loss landscape (e.g., smoothness) as well as global properties of the underlying model (e.g., generalization performance). While various methods for visualizing the loss landscape have been proposed, many approaches limit sampling to just one or two directions, ignoring potentially relevant information in this extremely high-dimensional space. This paper introduces a new representation based on topological data analysis that enables the visualization of higher-dimensional loss landscapes. After describing this new topological landscape profile representation, we show how the shape of loss landscapes can reveal new details about model performance and learning dynamics, highlighting several use cases, including image segmentation (e.g., UNet) and scientific machine learning (e.g., physics-informed neural networks). Through these examples, we provide new insights into how loss landscapes vary across distinct hyperparameter spaces: we find that the topology of the loss landscape is simpler for better-performing models; and we observe greater variation in the shape of loss landscapes near transitions from low to high model performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.12136v1-abstract-full').style.display = 'none'; document.getElementById('2411.12136v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.09807</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Evaluating Loss Landscapes from a Topology Perspective </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Xie%2C+T">Tiankai Xie</a>, <a href="/search/cs?searchtype=author&amp;query=Geniesse%2C+C">Caleb Geniesse</a>, <a href="/search/cs?searchtype=author&amp;query=Chen%2C+J">Jiaqing Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yaoqing Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Morozov%2C+D">Dmitriy Morozov</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Maciejewski%2C+R">Ross Maciejewski</a>, <a href="/search/cs?searchtype=author&amp;query=Weber%2C+G+H">Gunther H. Weber</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.09807v1-abstract-short" style="display: inline;"> Characterizing the loss of a neural network with respect to model parameters, i.e., the loss landscape, can provide valuable insights into properties of that model. Various methods for visualizing loss landscapes have been proposed, but less emphasis has been placed on quantifying and extracting actionable and reproducible insights from these complex representations. Inspired by powerful tools fro&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09807v1-abstract-full').style.display = 'inline'; document.getElementById('2411.09807v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.09807v1-abstract-full" style="display: none;"> Characterizing the loss of a neural network with respect to model parameters, i.e., the loss landscape, can provide valuable insights into properties of that model. Various methods for visualizing loss landscapes have been proposed, but less emphasis has been placed on quantifying and extracting actionable and reproducible insights from these complex representations. Inspired by powerful tools from topological data analysis (TDA) for summarizing the structure of high-dimensional data, here we characterize the underlying shape (or topology) of loss landscapes, quantifying the topology to reveal new insights about neural networks. To relate our findings to the machine learning (ML) literature, we compute simple performance metrics (e.g., accuracy, error), and we characterize the local structure of loss landscapes using Hessian-based metrics (e.g., largest eigenvalue, trace, eigenvalue spectral density). Following this approach, we study established models from image pattern recognition (e.g., ResNets) and scientific ML (e.g., physics-informed neural networks), and we show how quantifying the shape of loss landscapes can provide new insights into model performance and learning dynamics. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09807v1-abstract-full').style.display = 'none'; document.getElementById('2411.09807v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.09688</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Squeezed Attention: Accelerating Long Context Length LLM Inference </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Hooper%2C+C">Coleman Hooper</a>, <a href="/search/cs?searchtype=author&amp;query=Kim%2C+S">Sehoon Kim</a>, <a href="/search/cs?searchtype=author&amp;query=Mohammadzadeh%2C+H">Hiva Mohammadzadeh</a>, <a href="/search/cs?searchtype=author&amp;query=Maheswaran%2C+M">Monishwaran Maheswaran</a>, <a href="/search/cs?searchtype=author&amp;query=Paik%2C+J">June Paik</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Keutzer%2C+K">Kurt Keutzer</a>, <a href="/search/cs?searchtype=author&amp;query=Gholami%2C+A">Amir Gholami</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.09688v1-abstract-short" style="display: inline;"> Emerging Large Language Model (LLM) applications require long input prompts to perform complex downstream tasks like document analysis and code generation. For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length. However, for many of these applications,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09688v1-abstract-full').style.display = 'inline'; document.getElementById('2411.09688v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.09688v1-abstract-full" style="display: none;"> Emerging Large Language Model (LLM) applications require long input prompts to perform complex downstream tasks like document analysis and code generation. For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length. However, for many of these applications, much of the context in the prompt is fixed across different user inputs, thereby providing the opportunity to perform offline optimizations to process user inputs quickly, as they are received. In this work, we propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We first leverage K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. During inference, we compare query tokens from the user input with the centroids to predict which of the keys from the fixed context are semantically relevant and need to be loaded during inference. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs. We also extend our method to use a hierarchical centroid lookup to identify important keys, which can reduce the complexity of attention from linear to logarithmic with respect to the context length. We implement optimized Triton kernels for centroid comparison and sparse FlashAttention with important keys, achieving more than 4x speedups during both the prefill and generation phases for long-context inference. Furthermore, we have extensively evaluated our method on various long-context benchmarks including LongBench, where it achieves a 3x reduction in KV cache budget without accuracy loss and up to an 8x reduction with &lt;0.5 point accuracy gap for various models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.09688v1-abstract-full').style.display = 'none'; document.getElementById('2411.09688v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.05852</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> $\spadesuit$ SPADE $\spadesuit$ Split Peak Attention DEcomposition </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Wolff%2C+M">Malcolm Wolff</a>, <a href="/search/cs?searchtype=author&amp;query=Olivares%2C+K+G">Kin G. Olivares</a>, <a href="/search/cs?searchtype=author&amp;query=Oreshkin%2C+B">Boris Oreshkin</a>, <a href="/search/cs?searchtype=author&amp;query=Ruan%2C+S">Sunny Ruan</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+S">Sitan Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Katoch%2C+A">Abhinav Katoch</a>, <a href="/search/cs?searchtype=author&amp;query=Ramasubramanian%2C+S">Shankar Ramasubramanian</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+Y">Youxin Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Efimov%2C+D">Dmitry Efimov</a>, <a href="/search/cs?searchtype=author&amp;query=Quenneville-B%C3%A9lair%2C+V">Vincent Quenneville-B茅lair</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.05852v1-abstract-short" style="display: inline;"> Demand forecasting faces challenges induced by Peak Events (PEs) corresponding to special periods such as promotions and holidays. Peak events create significant spikes in demand followed by demand ramp down periods. Neural networks like MQCNN and MQT overreact to demand peaks by carrying over the elevated PE demand into subsequent Post-Peak-Event (PPE) periods, resulting in significantly over-bia&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05852v1-abstract-full').style.display = 'inline'; document.getElementById('2411.05852v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.05852v1-abstract-full" style="display: none;"> Demand forecasting faces challenges induced by Peak Events (PEs) corresponding to special periods such as promotions and holidays. Peak events create significant spikes in demand followed by demand ramp down periods. Neural networks like MQCNN and MQT overreact to demand peaks by carrying over the elevated PE demand into subsequent Post-Peak-Event (PPE) periods, resulting in significantly over-biased forecasts. To tackle this challenge, we introduce a neural forecasting model called Split Peak Attention DEcomposition, SPADE. This model reduces the impact of PEs on subsequent forecasts by modeling forecasting as consisting of two separate tasks: one for PEs; and the other for the rest. Its architecture then uses masked convolution filters and a specialized Peak Attention module. We show SPADE&#39;s performance on a worldwide retail dataset with hundreds of millions of products. Our results reveal a reduction in PPE degradation by 4.5% and an improvement in PE accuracy by 3.9%, relative to current production models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.05852v1-abstract-full').style.display = 'none'; document.getElementById('2411.05852v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> In 31st Conference on Neural Information Processing In 38th Conference on Neural Information Processing Systems NIPS 2017, Time Series in the Age of Large Models Workshop, 2024 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2411.00328</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> How many classifiers do we need? </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kim%2C+H">Hyunsuk Kim</a>, <a href="/search/cs?searchtype=author&amp;query=Hodgkinson%2C+L">Liam Hodgkinson</a>, <a href="/search/cs?searchtype=author&amp;query=Theisen%2C+R">Ryan Theisen</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2411.00328v1-abstract-short" style="display: inline;"> As performance gains through scaling data and/or model size experience diminishing returns, it is becoming increasingly popular to turn to ensembling, where the predictions of multiple models are combined to improve accuracy. In this paper, we provide a detailed analysis of how the disagreement and the polarization (a notion we introduce and define in this paper) among classifiers relate to the pe&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00328v1-abstract-full').style.display = 'inline'; document.getElementById('2411.00328v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2411.00328v1-abstract-full" style="display: none;"> As performance gains through scaling data and/or model size experience diminishing returns, it is becoming increasingly popular to turn to ensembling, where the predictions of multiple models are combined to improve accuracy. In this paper, we provide a detailed analysis of how the disagreement and the polarization (a notion we introduce and define in this paper) among classifiers relate to the performance gain achieved by aggregating individual classifiers, for majority vote strategies in classification tasks. We address these questions in the following ways. (1) An upper bound for polarization is derived, and we propose what we call a neural polarization law: most interpolating neural network models are 4/3-polarized. Our empirical results not only support this conjecture but also show that polarization is nearly constant for a dataset, regardless of hyperparameters or architectures of classifiers. (2) The error of the majority vote classifier is considered under restricted entropy conditions, and we present a tight upper bound that indicates that the disagreement is linearly correlated with the target, and that the slope is linear in the polarization. (3) We prove results for the asymptotic behavior of the disagreement in terms of the number of classifiers, which we show can help in predicting the performance for a larger number of classifiers from that of a smaller number. Our theories and claims are supported by empirical results on several image classification tasks with various types of neural networks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2411.00328v1-abstract-full').style.display = 'none'; document.getElementById('2411.00328v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 31 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.10912</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lu%2C+H">Haiquan Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yefan Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+S">Shiwei Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Z">Zhangyang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yaoqing Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.10912v1-abstract-short" style="display: inline;"> Recent work on pruning large language models (LLMs) has shown that one can eliminate a large number of parameters without compromising performance, making pruning a promising strategy to reduce LLM model size. Existing LLM pruning strategies typically assign uniform pruning ratios across layers, limiting overall pruning ability; and recent work on layerwise pruning of LLMs is often based on heuris&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.10912v1-abstract-full').style.display = 'inline'; document.getElementById('2410.10912v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.10912v1-abstract-full" style="display: none;"> Recent work on pruning large language models (LLMs) has shown that one can eliminate a large number of parameters without compromising performance, making pruning a promising strategy to reduce LLM model size. Existing LLM pruning strategies typically assign uniform pruning ratios across layers, limiting overall pruning ability; and recent work on layerwise pruning of LLMs is often based on heuristics that can easily lead to suboptimal performance. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory, in particular the shape of empirical spectral densities (ESDs) of weight matrices, to design improved layerwise pruning ratios for LLMs. Our analysis reveals a wide variability in how well-trained, and thus relatedly how prunable, different layers of an LLM are. Based on this, we propose AlphaPruning, which uses shape metrics to allocate layerwise sparsity ratios in a more theoretically principled manner. AlphaPruning can be used in conjunction with multiple existing LLM pruning methods. Our empirical results show that AlphaPruning prunes LLaMA-7B to 80% sparsity while maintaining reasonable perplexity, marking a first in the literature on LLMs. We have open-sourced our code at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.10912v1-abstract-full').style.display = 'none'; document.getElementById('2410.10912v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NeurIPS 2024, first two authors contributed equally</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.03229</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Elucidating the Design Choice of Probability Paths in Flow Matching for Forecasting </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lim%2C+S+H">Soon Hoe Lim</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yijin Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Yu%2C+A">Annan Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Hart%2C+E">Emma Hart</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+X+S">Xiaoye S. Li</a>, <a href="/search/cs?searchtype=author&amp;query=Erichson%2C+N+B">N. Benjamin Erichson</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.03229v1-abstract-short" style="display: inline;"> Flow matching has recently emerged as a powerful paradigm for generative modeling and has been extended to probabilistic time series forecasting in latent spaces. However, the impact of the specific choice of probability path model on forecasting performance remains under-explored. In this work, we demonstrate that forecasting spatio-temporal data with flow matching is highly sensitive to the sele&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.03229v1-abstract-full').style.display = 'inline'; document.getElementById('2410.03229v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.03229v1-abstract-full" style="display: none;"> Flow matching has recently emerged as a powerful paradigm for generative modeling and has been extended to probabilistic time series forecasting in latent spaces. However, the impact of the specific choice of probability path model on forecasting performance remains under-explored. In this work, we demonstrate that forecasting spatio-temporal data with flow matching is highly sensitive to the selection of the probability path model. Motivated by this insight, we propose a novel probability path model designed to improve forecasting performance. Our empirical results across various dynamical system benchmarks show that our model achieves faster convergence during training and improved predictive performance compared to existing probability path models. Importantly, our approach is efficient during inference, requiring only a few sampling steps. This makes our proposed model practical for real-world applications and opens new avenues for probabilistic forecasting. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.03229v1-abstract-full').style.display = 'none'; document.getElementById('2410.03229v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">30 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.02159</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Mitigating Memorization In Language Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Sakarvadia%2C+M">Mansi Sakarvadia</a>, <a href="/search/cs?searchtype=author&amp;query=Ajith%2C+A">Aswathy Ajith</a>, <a href="/search/cs?searchtype=author&amp;query=Khan%2C+A">Arham Khan</a>, <a href="/search/cs?searchtype=author&amp;query=Hudson%2C+N">Nathaniel Hudson</a>, <a href="/search/cs?searchtype=author&amp;query=Geniesse%2C+C">Caleb Geniesse</a>, <a href="/search/cs?searchtype=author&amp;query=Chard%2C+K">Kyle Chard</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yaoqing Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Foster%2C+I">Ian Foster</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.02159v1-abstract-short" style="display: inline;"> Language models (LMs) can &#34;memorize&#34; information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three finetuning-bas&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.02159v1-abstract-full').style.display = 'inline'; document.getElementById('2410.02159v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.02159v1-abstract-full" style="display: none;"> Language models (LMs) can &#34;memorize&#34; information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three finetuning-based, and eleven machine unlearning-based methods, with five of the latter being new methods that we introduce. We also introduce TinyMem, a suite of small, computationally-efficient LMs for the rapid development and evaluation of memorization-mitigation methods. We demonstrate that the mitigation methods that we develop using TinyMem can successfully be applied to production-grade LMs, and we determine via experiment that: regularizer-based mitigation methods are slow and ineffective at curbing memorization; fine-tuning-based methods are effective at curbing memorization, but overly expensive, especially for retaining higher accuracies; and unlearning-based methods are faster and more effective, allowing for the precise localization and removal of memorized information from LM weights prior to inference. We show, in particular, that our proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.02159v1-abstract-full').style.display = 'none'; document.getElementById('2410.02159v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2410.02035</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Tuning Frequency Bias of State Space Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yu%2C+A">Annan Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Lyu%2C+D">Dongwei Lyu</a>, <a href="/search/cs?searchtype=author&amp;query=Lim%2C+S+H">Soon Hoe Lim</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Erichson%2C+N+B">N. Benjamin Erichson</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2410.02035v1-abstract-short" style="display: inline;"> State space models (SSMs) leverage linear, time-invariant (LTI) systems to effectively learn sequences with long-range dependencies. By analyzing the transfer functions of LTI systems, we find that SSMs exhibit an implicit bias toward capturing low-frequency components more effectively than high-frequency ones. This behavior aligns with the broader notion of frequency bias in deep learning model t&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.02035v1-abstract-full').style.display = 'inline'; document.getElementById('2410.02035v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2410.02035v1-abstract-full" style="display: none;"> State space models (SSMs) leverage linear, time-invariant (LTI) systems to effectively learn sequences with long-range dependencies. By analyzing the transfer functions of LTI systems, we find that SSMs exhibit an implicit bias toward capturing low-frequency components more effectively than high-frequency ones. This behavior aligns with the broader notion of frequency bias in deep learning model training. We show that the initialization of an SSM assigns it an innate frequency bias and that training the model in a conventional way does not alter this bias. Based on our theory, we propose two mechanisms to tune frequency bias: either by scaling the initialization to tune the inborn frequency bias; or by applying a Sobolev-norm-based filter to adjust the sensitivity of the gradients to high-frequency inputs, which allows us to change the frequency bias via training. Using an image-denoising task, we empirically show that we can strengthen, weaken, or even reverse the frequency bias using both mechanisms. By tuning the frequency bias, we can also improve SSMs&#39; performance on learning long-range sequences, averaging an 88.26% accuracy on the Long-Range Arena (LRA) benchmark tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2410.02035v1-abstract-full').style.display = 'none'; document.getElementById('2410.02035v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2409.15734</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Optimization and Control">math.OC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Numerical Analysis">math.NA</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computation">stat.CO</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Trust-Region Sequential Quadratic Programming for Stochastic Optimization with Random Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Fang%2C+Y">Yuchen Fang</a>, <a href="/search/cs?searchtype=author&amp;query=Na%2C+S">Sen Na</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Kolar%2C+M">Mladen Kolar</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2409.15734v2-abstract-short" style="display: inline;"> In this work, we consider solving optimization problems with a stochastic objective and deterministic equality constraints. We propose a Trust-Region Sequential Quadratic Programming method to find both first- and second-order stationary points. Our method utilizes a random model to represent the objective function, which is constructed from stochastic observations of the objective and is designed&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.15734v2-abstract-full').style.display = 'inline'; document.getElementById('2409.15734v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2409.15734v2-abstract-full" style="display: none;"> In this work, we consider solving optimization problems with a stochastic objective and deterministic equality constraints. We propose a Trust-Region Sequential Quadratic Programming method to find both first- and second-order stationary points. Our method utilizes a random model to represent the objective function, which is constructed from stochastic observations of the objective and is designed to satisfy proper adaptive accuracy conditions with a high but fixed probability. To converge to first-order stationary points, our method computes a gradient step in each iteration defined by minimizing a quadratic approximation of the objective subject to a (relaxed) linear approximation of the problem constraints and a trust-region constraint. To converge to second-order stationary points, our method additionally computes an eigen step to explore the negative curvature of the reduced Hessian matrix, as well as a second-order correction step to address the potential Maratos effect, which arises due to the nonlinearity of the problem constraints. Such an effect may impede the method from moving away from saddle points. Both gradient and eigen step computations leverage a novel parameter-free decomposition of the step and the trust-region radius, accounting for the proportions among the feasibility residual, optimality residual, and negative curvature. We establish global almost sure first- and second-order convergence guarantees for our method, and present computational results on CUTEst problems, regression problems, and saddle-point problems to demonstrate its superiority over existing line-search-based stochastic methods. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2409.15734v2-abstract-full').style.display = 'none'; document.getElementById('2409.15734v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 26 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 24 September, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">41 pages, 3 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.15089</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Geophysics">physics.geo-ph</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Learning Physics for Unveiling Hidden Earthquake Ground Motions via Conditional Generative Modeling </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ren%2C+P">Pu Ren</a>, <a href="/search/cs?searchtype=author&amp;query=Nakata%2C+R">Rie Nakata</a>, <a href="/search/cs?searchtype=author&amp;query=Lacour%2C+M">Maxime Lacour</a>, <a href="/search/cs?searchtype=author&amp;query=Naiman%2C+I">Ilan Naiman</a>, <a href="/search/cs?searchtype=author&amp;query=Nakata%2C+N">Nori Nakata</a>, <a href="/search/cs?searchtype=author&amp;query=Song%2C+J">Jialin Song</a>, <a href="/search/cs?searchtype=author&amp;query=Bi%2C+Z">Zhengfa Bi</a>, <a href="/search/cs?searchtype=author&amp;query=Malik%2C+O+A">Osman Asif Malik</a>, <a href="/search/cs?searchtype=author&amp;query=Morozov%2C+D">Dmitriy Morozov</a>, <a href="/search/cs?searchtype=author&amp;query=Azencot%2C+O">Omri Azencot</a>, <a href="/search/cs?searchtype=author&amp;query=Erichson%2C+N+B">N. Benjamin Erichson</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.15089v1-abstract-short" style="display: inline;"> Predicting high-fidelity ground motions for future earthquakes is crucial for seismic hazard assessment and infrastructure resilience. Conventional empirical simulations suffer from sparse sensor distribution and geographically localized earthquake locations, while physics-based methods are computationally intensive and require accurate representations of Earth structures and earthquake sources. W&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.15089v1-abstract-full').style.display = 'inline'; document.getElementById('2407.15089v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.15089v1-abstract-full" style="display: none;"> Predicting high-fidelity ground motions for future earthquakes is crucial for seismic hazard assessment and infrastructure resilience. Conventional empirical simulations suffer from sparse sensor distribution and geographically localized earthquake locations, while physics-based methods are computationally intensive and require accurate representations of Earth structures and earthquake sources. We propose a novel artificial intelligence (AI) simulator, Conditional Generative Modeling for Ground Motion (CGM-GM), to synthesize high-frequency and spatially continuous earthquake ground motion waveforms. CGM-GM leverages earthquake magnitudes and geographic coordinates of earthquakes and sensors as inputs, learning complex wave physics and Earth heterogeneities, without explicit physics constraints. This is achieved through a probabilistic autoencoder that captures latent distributions in the time-frequency domain and variational sequential models for prior and posterior distributions. We evaluate the performance of CGM-GM using small-magnitude earthquake records from the San Francisco Bay Area, a region with high seismic risks. CGM-GM demonstrates a strong potential for outperforming a state-of-the-art non-ergodic empirical ground motion model and shows great promise in seismology and beyond. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.15089v1-abstract-full').style.display = 'none'; document.getElementById('2407.15089v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.14129</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Comparing and Contrasting Deep Learning Weather Prediction Backbones on Navier-Stokes and Atmospheric Dynamics </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Karlbauer%2C+M">Matthias Karlbauer</a>, <a href="/search/cs?searchtype=author&amp;query=Maddix%2C+D+C">Danielle C. Maddix</a>, <a href="/search/cs?searchtype=author&amp;query=Ansari%2C+A+F">Abdul Fatir Ansari</a>, <a href="/search/cs?searchtype=author&amp;query=Han%2C+B">Boran Han</a>, <a href="/search/cs?searchtype=author&amp;query=Gupta%2C+G">Gaurav Gupta</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuyang Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Stuart%2C+A">Andrew Stuart</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.14129v2-abstract-short" style="display: inline;"> Remarkable progress in the development of Deep Learning Weather Prediction (DLWP) models positions them to become competitive with traditional numerical weather prediction (NWP) models. Indeed, a wide number of DLWP architectures -- based on various backbones, including U-Net, Transformer, Graph Neural Network (GNN), and Fourier Neural Operator (FNO) -- have demonstrated their potential at forecas&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.14129v2-abstract-full').style.display = 'inline'; document.getElementById('2407.14129v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.14129v2-abstract-full" style="display: none;"> Remarkable progress in the development of Deep Learning Weather Prediction (DLWP) models positions them to become competitive with traditional numerical weather prediction (NWP) models. Indeed, a wide number of DLWP architectures -- based on various backbones, including U-Net, Transformer, Graph Neural Network (GNN), and Fourier Neural Operator (FNO) -- have demonstrated their potential at forecasting atmospheric states. However, due to differences in training protocols, forecast horizons, and data choices, it remains unclear which (if any) of these methods and architectures are most suitable for weather forecasting and for future model development. Here, we step back and provide a detailed empirical analysis, under controlled conditions, comparing and contrasting the most prominent DLWP models, along with their backbones. We accomplish this by predicting synthetic two-dimensional incompressible Navier-Stokes and real-world global weather dynamics. In terms of accuracy, memory consumption, and runtime, our results illustrate various tradeoffs. For example, on synthetic data, we observe favorable performance of FNO; and on the real-world WeatherBench dataset, our results demonstrate the suitability of ConvLSTM and SwinTransformer for short-to-mid-ranged forecasts. For long-ranged weather rollouts of up to 365 days, we observe superior stability and physical soundness in architectures that formulate a spherical data representation, i.e., GraphCast and Spherical FNO. In addition, we observe that all of these model backbones &#34;saturate,&#34; i.e., none of them exhibit so-called neural scaling, which highlights an important direction for future work on these and related models. The code is available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.14129v2-abstract-full').style.display = 'none'; document.getElementById('2407.14129v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 19 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2407.12996</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Sharpness-diversity tradeoff: improving flat ensembles with SharpBalance </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lu%2C+H">Haiquan Lu</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+X">Xiaotian Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yefan Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+Q">Qunli Li</a>, <a href="/search/cs?searchtype=author&amp;query=Keutzer%2C+K">Kurt Keutzer</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Yan%2C+Y">Yujun Yan</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+H">Huanrui Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yaoqing Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2407.12996v1-abstract-short" style="display: inline;"> Recent studies on deep ensembles have identified the sharpness of the local minima of individual learners and the diversity of the ensemble members as key factors in improving test-time performance. Building on this, our study investigates the interplay between sharpness and diversity within deep ensembles, illustrating their crucial role in robust generalization to both in-distribution (ID) and o&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.12996v1-abstract-full').style.display = 'inline'; document.getElementById('2407.12996v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2407.12996v1-abstract-full" style="display: none;"> Recent studies on deep ensembles have identified the sharpness of the local minima of individual learners and the diversity of the ensemble members as key factors in improving test-time performance. Building on this, our study investigates the interplay between sharpness and diversity within deep ensembles, illustrating their crucial role in robust generalization to both in-distribution (ID) and out-of-distribution (OOD) data. We discover a trade-off between sharpness and diversity: minimizing the sharpness in the loss landscape tends to diminish the diversity of individual members within the ensemble, adversely affecting the ensemble&#39;s improvement. The trade-off is justified through our theoretical analysis and verified empirically through extensive experiments. To address the issue of reduced diversity, we introduce SharpBalance, a novel training approach that balances sharpness and diversity within ensembles. Theoretically, we show that our training strategy achieves a better sharpness-diversity trade-off. Empirically, we conducted comprehensive evaluations in various data sets (CIFAR-10, CIFAR-100, TinyImageNet) and showed that SharpBalance not only effectively improves the sharpness-diversity trade-off, but also significantly improves ensemble performance in ID and OOD scenarios. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2407.12996v1-abstract-full').style.display = 'none'; document.getElementById('2407.12996v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 17 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.19522</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Reliable edge machine learning hardware for scientific applications </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Baldi%2C+T">Tommaso Baldi</a>, <a href="/search/cs?searchtype=author&amp;query=Campos%2C+J">Javier Campos</a>, <a href="/search/cs?searchtype=author&amp;query=Hawks%2C+B">Ben Hawks</a>, <a href="/search/cs?searchtype=author&amp;query=Ngadiuba%2C+J">Jennifer Ngadiuba</a>, <a href="/search/cs?searchtype=author&amp;query=Tran%2C+N">Nhan Tran</a>, <a href="/search/cs?searchtype=author&amp;query=Diaz%2C+D">Daniel Diaz</a>, <a href="/search/cs?searchtype=author&amp;query=Duarte%2C+J">Javier Duarte</a>, <a href="/search/cs?searchtype=author&amp;query=Kastner%2C+R">Ryan Kastner</a>, <a href="/search/cs?searchtype=author&amp;query=Meza%2C+A">Andres Meza</a>, <a href="/search/cs?searchtype=author&amp;query=Quinnan%2C+M">Melissa Quinnan</a>, <a href="/search/cs?searchtype=author&amp;query=Weng%2C+O">Olivia Weng</a>, <a href="/search/cs?searchtype=author&amp;query=Geniesse%2C+C">Caleb Geniesse</a>, <a href="/search/cs?searchtype=author&amp;query=Gholami%2C+A">Amir Gholami</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Loncar%2C+V">Vladimir Loncar</a>, <a href="/search/cs?searchtype=author&amp;query=Harris%2C+P">Philip Harris</a>, <a href="/search/cs?searchtype=author&amp;query=Agar%2C+J">Joshua Agar</a>, <a href="/search/cs?searchtype=author&amp;query=Qin%2C+S">Shuyu Qin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.19522v1-abstract-short" style="display: inline;"> Extreme data rate scientific experiments create massive amounts of data that require efficient ML edge processing. This leads to unique validation challenges for VLSI implementations of ML algorithms: enabling bit-accurate functional simulations for performance validation in experimental software frameworks, verifying those ML models are robust under extreme quantization and pruning, and enabling&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.19522v1-abstract-full').style.display = 'inline'; document.getElementById('2406.19522v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.19522v1-abstract-full" style="display: none;"> Extreme data rate scientific experiments create massive amounts of data that require efficient ML edge processing. This leads to unique validation challenges for VLSI implementations of ML algorithms: enabling bit-accurate functional simulations for performance validation in experimental software frameworks, verifying those ML models are robust under extreme quantization and pruning, and enabling ultra-fine-grained model inspection for efficient fault tolerance. We discuss approaches to developing and validating reliable algorithms at the scientific edge under such strict latency, resource, power, and area requirements in extreme experimental environments. We study metrics for developing robust algorithms, present preliminary results and mitigation strategies, and conclude with an outlook of these and future directions of research towards the longer-term goal of developing autonomous scientific experimentation methods for accelerated scientific discovery. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.19522v1-abstract-full').style.display = 'none'; document.getElementById('2406.19522v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">IEEE VLSI Test Symposium 2024 (VTS)</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Report number:</span> FERMILAB-CONF-24-0116-CSAID </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.11151</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Numerical Analysis">math.NA</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Recent and Upcoming Developments in Randomized Numerical Linear Algebra for Machine Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Derezi%C5%84ski%2C+M">Micha艂 Derezi艅ski</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.11151v2-abstract-short" style="display: inline;"> Large matrices arise in many machine learning and data analysis applications, including as representations of datasets, graphs, model weights, and first and second-order derivatives. Randomized Numerical Linear Algebra (RandNLA) is an area which uses randomness to develop improved algorithms for ubiquitous matrix problems. The area has reached a certain level of maturity; but recent hardware trend&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.11151v2-abstract-full').style.display = 'inline'; document.getElementById('2406.11151v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.11151v2-abstract-full" style="display: none;"> Large matrices arise in many machine learning and data analysis applications, including as representations of datasets, graphs, model weights, and first and second-order derivatives. Randomized Numerical Linear Algebra (RandNLA) is an area which uses randomness to develop improved algorithms for ubiquitous matrix problems. The area has reached a certain level of maturity; but recent hardware trends, efforts to incorporate RandNLA algorithms into core numerical libraries, and advances in machine learning, statistics, and random matrix theory, have lead to new theoretical and practical challenges. This article provides a self-contained overview of RandNLA, in light of these developments. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.11151v2-abstract-full').style.display = 'none'; document.getElementById('2406.11151v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 18 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 16 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2406.09997</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Towards Scalable and Versatile Weight Space Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Sch%C3%BCrholt%2C+K">Konstantin Sch眉rholt</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Borth%2C+D">Damian Borth</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2406.09997v1-abstract-short" style="display: inline;"> Learning representations of well-trained neural network models holds the promise to provide an understanding of the inner workings of those models. However, previous work has either faced limitations when processing larger networks or was task-specific to either discriminative or generative tasks. This paper introduces the SANE approach to weight-space learning. SANE overcomes previous limitations&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.09997v1-abstract-full').style.display = 'inline'; document.getElementById('2406.09997v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2406.09997v1-abstract-full" style="display: none;"> Learning representations of well-trained neural network models holds the promise to provide an understanding of the inner workings of those models. However, previous work has either faced limitations when processing larger networks or was task-specific to either discriminative or generative tasks. This paper introduces the SANE approach to weight-space learning. SANE overcomes previous limitations by learning task-agnostic representations of neural networks that are scalable to larger models of varying architectures and that show capabilities beyond a single task. Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights, thus allowing one to embed larger neural networks as a set of tokens into the learned representation space. SANE reveals global model information from layer-wise embeddings, and it can sequentially generate unseen neural network models, which was unattainable with previous hyper-representation learning methods. Extensive empirical evaluation demonstrates that SANE matches or exceeds state-of-the-art performance on several weight representation learning benchmarks, particularly in initialization for new tasks and larger ResNet architectures. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2406.09997v1-abstract-full').style.display = 'none'; document.getElementById('2406.09997v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 14 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted at ICML 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2405.20516</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Geophysics">physics.geo-ph</span> </div> </div> <p class="title is-5 mathjax"> WaveCastNet: An AI-enabled Wavefield Forecasting Framework for Earthquake Early Warning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lyu%2C+D">Dongwei Lyu</a>, <a href="/search/cs?searchtype=author&amp;query=Nakata%2C+R">Rie Nakata</a>, <a href="/search/cs?searchtype=author&amp;query=Ren%2C+P">Pu Ren</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Pitarka%2C+A">Arben Pitarka</a>, <a href="/search/cs?searchtype=author&amp;query=Nakata%2C+N">Nori Nakata</a>, <a href="/search/cs?searchtype=author&amp;query=Erichson%2C+N+B">N. Benjamin Erichson</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.20516v1-abstract-short" style="display: inline;"> Large earthquakes can be destructive and quickly wreak havoc on a landscape. To mitigate immediate threats, early warning systems have been developed to alert residents, emergency responders, and critical infrastructure operators seconds to a minute before seismic waves arrive. These warnings provide time to take precautions and prevent damage. The success of these systems relies on fast, accurate&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.20516v1-abstract-full').style.display = 'inline'; document.getElementById('2405.20516v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.20516v1-abstract-full" style="display: none;"> Large earthquakes can be destructive and quickly wreak havoc on a landscape. To mitigate immediate threats, early warning systems have been developed to alert residents, emergency responders, and critical infrastructure operators seconds to a minute before seismic waves arrive. These warnings provide time to take precautions and prevent damage. The success of these systems relies on fast, accurate predictions of ground motion intensities, which is challenging due to the complex physics of earthquakes, wave propagation, and their intricate spatial and temporal interactions. To improve early warning, we propose a novel AI-enabled framework, WaveCastNet, for forecasting ground motions from large earthquakes. WaveCastNet integrates a novel convolutional Long Expressive Memory (ConvLEM) model into a sequence to sequence (seq2seq) forecasting framework to model long-term dependencies and multi-scale patterns in both space and time. WaveCastNet, which shares weights across spatial and temporal dimensions, requires fewer parameters compared to more resource-intensive models like transformers and thus, in turn, reduces inference times. Importantly, WaveCastNet also generalizes better than transformer-based models to different seismic scenarios, including to more rare and critical situations with higher magnitude earthquakes. Our results using simulated data from the San Francisco Bay Area demonstrate the capability to rapidly predict the intensity and timing of destructive ground motions. Importantly, our proposed approach does not require estimating earthquake magnitudes and epicenters, which are prone to errors using conventional approaches; nor does it require empirical ground motion models, which fail to capture strongly heterogeneous wave propagation effects. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.20516v1-abstract-full').style.display = 'none'; document.getElementById('2405.20516v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2405.13975</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> HOPE for a Robust Parameterization of Long-memory State Space Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yu%2C+A">Annan Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Erichson%2C+N+B">N. Benjamin Erichson</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2405.13975v2-abstract-short" style="display: inline;"> State-space models (SSMs) that utilize linear, time-invariant (LTI) systems are known for their effectiveness in learning long sequences. To achieve state-of-the-art performance, an SSM often needs a specifically designed initialization, and the training of state matrices is on a logarithmic scale with a very small learning rate. To understand these choices from a unified perspective, we view SSMs&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.13975v2-abstract-full').style.display = 'inline'; document.getElementById('2405.13975v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2405.13975v2-abstract-full" style="display: none;"> State-space models (SSMs) that utilize linear, time-invariant (LTI) systems are known for their effectiveness in learning long sequences. To achieve state-of-the-art performance, an SSM often needs a specifically designed initialization, and the training of state matrices is on a logarithmic scale with a very small learning rate. To understand these choices from a unified perspective, we view SSMs through the lens of Hankel operator theory. Building upon it, we develop a new parameterization scheme, called HOPE, for LTI systems that utilizes Markov parameters within Hankel operators. Our approach helps improve the initialization and training stability, leading to a more robust parameterization. We efficiently implement these innovations by nonuniformly sampling the transfer functions of LTI systems, and they require fewer parameters compared to canonical SSMs. When benchmarked against HiPPO-initialized models such as S4 and S4D, an SSM parameterized by Hankel operators demonstrates improved performance on Long-Range Arena (LRA) tasks. Moreover, our new parameterization endows the SSM with non-decaying memory within a fixed time window, which is empirically corroborated by a sequential CIFAR-10 task with padded noise. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2405.13975v2-abstract-full').style.display = 'none'; document.getElementById('2405.13975v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 22 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2403.15042</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Lee%2C+N">Nicholas Lee</a>, <a href="/search/cs?searchtype=author&amp;query=Wattanawong%2C+T">Thanakul Wattanawong</a>, <a href="/search/cs?searchtype=author&amp;query=Kim%2C+S">Sehoon Kim</a>, <a href="/search/cs?searchtype=author&amp;query=Mangalam%2C+K">Karttikeya Mangalam</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+S">Sheng Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Anumanchipalli%2C+G">Gopala Anumanchipalli</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Keutzer%2C+K">Kurt Keutzer</a>, <a href="/search/cs?searchtype=author&amp;query=Gholami%2C+A">Amir Gholami</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2403.15042v2-abstract-short" style="display: inline;"> Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation st&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.15042v2-abstract-full').style.display = 'inline'; document.getElementById('2403.15042v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2403.15042v2-abstract-full" style="display: none;"> Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a Llama-2-7B student model. Our code is available at . <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.15042v2-abstract-full').style.display = 'none'; document.getElementById('2403.15042v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 July, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 22 March, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> March 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ACL 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2403.14123</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Hardware Architecture">cs.AR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Distributed, Parallel, and Cluster Computing">cs.DC</span> </div> </div> <p class="title is-5 mathjax"> AI and Memory Wall </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Gholami%2C+A">Amir Gholami</a>, <a href="/search/cs?searchtype=author&amp;query=Yao%2C+Z">Zhewei Yao</a>, <a href="/search/cs?searchtype=author&amp;query=Kim%2C+S">Sehoon Kim</a>, <a href="/search/cs?searchtype=author&amp;query=Hooper%2C+C">Coleman Hooper</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Keutzer%2C+K">Kurt Keutzer</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2403.14123v1-abstract-short" style="display: inline;"> The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.14123v1-abstract-full').style.display = 'inline'; document.getElementById('2403.14123v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2403.14123v1-abstract-full" style="display: none;"> The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. Here, we analyze encoder and decoder Transformer models and show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.14123v1-abstract-full').style.display = 'none'; document.getElementById('2403.14123v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 March, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> March 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Published in IEEE Micro Journal</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2403.10642</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Numerical Analysis">math.NA</span> </div> </div> <p class="title is-5 mathjax"> Using Uncertainty Quantification to Characterize and Improve Out-of-Domain Learning for PDEs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Mouli%2C+S+C">S. Chandra Mouli</a>, <a href="/search/cs?searchtype=author&amp;query=Maddix%2C+D+C">Danielle C. Maddix</a>, <a href="/search/cs?searchtype=author&amp;query=Alizadeh%2C+S">Shima Alizadeh</a>, <a href="/search/cs?searchtype=author&amp;query=Gupta%2C+G">Gaurav Gupta</a>, <a href="/search/cs?searchtype=author&amp;query=Stuart%2C+A">Andrew Stuart</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuyang Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2403.10642v2-abstract-short" style="display: inline;"> Existing work in scientific machine learning (SciML) has shown that data-driven learning of solution operators can provide a fast approximate alternative to classical numerical partial differential equation (PDE) solvers. Of these, Neural Operators (NOs) have emerged as particularly promising. We observe that several uncertainty quantification (UQ) methods for NOs fail for test inputs that are eve&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.10642v2-abstract-full').style.display = 'inline'; document.getElementById('2403.10642v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2403.10642v2-abstract-full" style="display: none;"> Existing work in scientific machine learning (SciML) has shown that data-driven learning of solution operators can provide a fast approximate alternative to classical numerical partial differential equation (PDE) solvers. Of these, Neural Operators (NOs) have emerged as particularly promising. We observe that several uncertainty quantification (UQ) methods for NOs fail for test inputs that are even moderately out-of-domain (OOD), even when the model approximates the solution well for in-domain tasks. To address this limitation, we show that ensembling several NOs can identify high-error regions and provide good uncertainty estimates that are well-correlated with prediction errors. Based on this, we propose a cost-effective alternative, DiverseNO, that mimics the properties of the ensemble by encouraging diverse predictions from its multiple heads in the last feed-forward layer. We then introduce Operator-ProbConserv, a method that uses these well-calibrated UQ estimates within the ProbConserv framework to update the model. Our empirical results show that Operator-ProbConserv enhances OOD model performance for a variety of challenging PDE problems and satisfies physical constraints such as conservation laws. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.10642v2-abstract-full').style.display = 'none'; document.getElementById('2403.10642v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 March, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> March 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ICML 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2403.07815</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Chronos: Learning the Language of Time Series </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ansari%2C+A+F">Abdul Fatir Ansari</a>, <a href="/search/cs?searchtype=author&amp;query=Stella%2C+L">Lorenzo Stella</a>, <a href="/search/cs?searchtype=author&amp;query=Turkmen%2C+C">Caner Turkmen</a>, <a href="/search/cs?searchtype=author&amp;query=Zhang%2C+X">Xiyuan Zhang</a>, <a href="/search/cs?searchtype=author&amp;query=Mercado%2C+P">Pedro Mercado</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+H">Huibin Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Shchur%2C+O">Oleksandr Shchur</a>, <a href="/search/cs?searchtype=author&amp;query=Rangapuram%2C+S+S">Syama Sundar Rangapuram</a>, <a href="/search/cs?searchtype=author&amp;query=Arango%2C+S+P">Sebastian Pineda Arango</a>, <a href="/search/cs?searchtype=author&amp;query=Kapoor%2C+S">Shubham Kapoor</a>, <a href="/search/cs?searchtype=author&amp;query=Zschiegner%2C+J">Jasper Zschiegner</a>, <a href="/search/cs?searchtype=author&amp;query=Maddix%2C+D+C">Danielle C. Maddix</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+H">Hao Wang</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Torkkola%2C+K">Kari Torkkola</a>, <a href="/search/cs?searchtype=author&amp;query=Wilson%2C+A+G">Andrew Gordon Wilson</a>, <a href="/search/cs?searchtype=author&amp;query=Bohlke-Schneider%2C+M">Michael Bohlke-Schneider</a>, <a href="/search/cs?searchtype=author&amp;query=Wang%2C+Y">Yuyang Wang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2403.07815v3-abstract-short" style="display: inline;"> We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model architectures on these tokenized time series via the cross-entropy loss. We pretrained Chronos models based on the T5 family (ranging from 20M to 710M&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.07815v3-abstract-full').style.display = 'inline'; document.getElementById('2403.07815v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2403.07815v3-abstract-full" style="display: none;"> We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model architectures on these tokenized time series via the cross-entropy loss. We pretrained Chronos models based on the T5 family (ranging from 20M to 710M parameters) on a large collection of publicly available datasets, complemented by a synthetic dataset that we generated via Gaussian processes to improve generalization. In a comprehensive benchmark consisting of 42 datasets, and comprising both classical local models and deep learning methods, we show that Chronos models: (a) significantly outperform other methods on datasets that were part of the training corpus; and (b) have comparable and occasionally superior zero-shot performance on new datasets, relative to methods that were trained specifically on them. Our results demonstrate that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks, positioning pretrained models as a viable tool to greatly simplify forecasting pipelines. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2403.07815v3-abstract-full').style.display = 'none'; document.getElementById('2403.07815v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 12 March, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> March 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Code and model checkpoints available at</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2402.15734</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Data-Efficient Operator Learning via Unsupervised Pretraining and In-Context Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Chen%2C+W">Wuyang Chen</a>, <a href="/search/cs?searchtype=author&amp;query=Song%2C+J">Jialin Song</a>, <a href="/search/cs?searchtype=author&amp;query=Ren%2C+P">Pu Ren</a>, <a href="/search/cs?searchtype=author&amp;query=Subramanian%2C+S">Shashank Subramanian</a>, <a href="/search/cs?searchtype=author&amp;query=Morozov%2C+D">Dmitriy Morozov</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2402.15734v3-abstract-short" style="display: inline;"> Recent years have witnessed the promise of coupling machine learning methods and physical domain-specific insights for solving scientific problems based on partial differential equations (PDEs). However, being data-intensive, these methods still require a large amount of PDE data. This reintroduces the need for expensive numerical PDE solutions, partially undermining the original goal of avoiding&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2402.15734v3-abstract-full').style.display = 'inline'; document.getElementById('2402.15734v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2402.15734v3-abstract-full" style="display: none;"> Recent years have witnessed the promise of coupling machine learning methods and physical domain-specific insights for solving scientific problems based on partial differential equations (PDEs). However, being data-intensive, these methods still require a large amount of PDE data. This reintroduces the need for expensive numerical PDE solutions, partially undermining the original goal of avoiding these expensive simulations. In this work, seeking data efficiency, we design unsupervised pretraining for PDE operator learning. To reduce the need for training data with heavy simulation costs, we mine unlabeled PDE data without simulated solutions, and we pretrain neural operators with physics-inspired reconstruction-based proxy tasks. To improve out-of-distribution performance, we further assist neural operators in flexibly leveraging a similarity-based method that learns in-context examples, without incurring extra training costs or designs. Extensive empirical evaluations on a diverse set of PDEs demonstrate that our method is highly data-efficient, more generalizable, and even outperforms conventional vision-pretrained models. We provide our code at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2402.15734v3-abstract-full').style.display = 'none'; document.getElementById('2402.15734v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 24 February, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> February 2024. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2401.18079</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Hooper%2C+C">Coleman Hooper</a>, <a href="/search/cs?searchtype=author&amp;query=Kim%2C+S">Sehoon Kim</a>, <a href="/search/cs?searchtype=author&amp;query=Mohammadzadeh%2C+H">Hiva Mohammadzadeh</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Shao%2C+Y+S">Yakun Sophia Shao</a>, <a href="/search/cs?searchtype=author&amp;query=Keutzer%2C+K">Kurt Keutzer</a>, <a href="/search/cs?searchtype=author&amp;query=Gholami%2C+A">Amir Gholami</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2401.18079v5-abstract-short" style="display: inline;"> LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuan&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.18079v5-abstract-full').style.display = 'inline'; document.getElementById('2401.18079v5-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2401.18079v5-abstract-full" style="display: none;"> LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; and (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges. By applying our method to the LLaMA, Llama-2, Llama-3, and Mistral models, we achieve &lt; 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. We develop custom CUDA kernels for KVQuant, showing that we can achieve up to ~1.7x speedups, compared to baseline fp16 matrix-vector multiplications, for the LLaMA-7B model. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.18079v5-abstract-full').style.display = 'none'; document.getElementById('2401.18079v5-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 25 October, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 31 January, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> January 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NeurIPS 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2401.00122</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> SALSA: Sequential Approximate Leverage-Score Algorithm with Application in Analyzing Big Time Series Data </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Eshragh%2C+A">Ali Eshragh</a>, <a href="/search/cs?searchtype=author&amp;query=Yerbury%2C+L">Luke Yerbury</a>, <a href="/search/cs?searchtype=author&amp;query=Nazari%2C+A">Asef Nazari</a>, <a href="/search/cs?searchtype=author&amp;query=Roosta%2C+F">Fred Roosta</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2401.00122v1-abstract-short" style="display: inline;"> We develop a new efficient sequential approximate leverage score algorithm, SALSA, using methods from randomized numerical linear algebra (RandNLA) for large matrices. We demonstrate that, with high probability, the accuracy of SALSA&#39;s approximations is within $(1 + O({\varepsilon}))$ of the true leverage scores. In addition, we show that the theoretical computational complexity and numerical accu&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.00122v1-abstract-full').style.display = 'inline'; document.getElementById('2401.00122v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2401.00122v1-abstract-full" style="display: none;"> We develop a new efficient sequential approximate leverage score algorithm, SALSA, using methods from randomized numerical linear algebra (RandNLA) for large matrices. We demonstrate that, with high probability, the accuracy of SALSA&#39;s approximations is within $(1 + O({\varepsilon}))$ of the true leverage scores. In addition, we show that the theoretical computational complexity and numerical accuracy of SALSA surpass existing approximations. These theoretical results are subsequently utilized to develop an efficient algorithm, named LSARMA, for fitting an appropriate ARMA model to large-scale time series data. Our proposed algorithm is, with high probability, guaranteed to find the maximum likelihood estimates of the parameters for the true underlying ARMA model. Furthermore, it has a worst-case running time that significantly improves those of the state-of-the-art alternatives in big data regimes. Empirical results on large-scale data strongly support these theoretical results and underscore the efficacy of our new approach. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.00122v1-abstract-full').style.display = 'none'; document.getElementById('2401.00122v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 29 December, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> January 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">42 pages, 7 figures</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">MSC Class:</span> 62M10 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2312.17351</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Social and Information Networks">cs.SI</span> </div> </div> <p class="title is-5 mathjax"> Multi-scale Local Network Structure Critically Impacts Epidemic Spread and Interventions </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Eldaghar%2C+O">Omar Eldaghar</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Gleich%2C+D+F">David F. Gleich</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2312.17351v1-abstract-short" style="display: inline;"> Network epidemic simulation holds the promise of enabling fine-grained understanding of epidemic behavior, beyond that which is possible with coarse-grained compartmental models. Key inputs to these epidemic simulations are the networks themselves. However, empirical measurements and samples of realistic interaction networks typically display properties that are challenging to capture with popular&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2312.17351v1-abstract-full').style.display = 'inline'; document.getElementById('2312.17351v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2312.17351v1-abstract-full" style="display: none;"> Network epidemic simulation holds the promise of enabling fine-grained understanding of epidemic behavior, beyond that which is possible with coarse-grained compartmental models. Key inputs to these epidemic simulations are the networks themselves. However, empirical measurements and samples of realistic interaction networks typically display properties that are challenging to capture with popular synthetic models of networks. Our empirical results show that epidemic spread behavior is very sensitive to a subtle but ubiquitous form of multi-scale local structure that is not present in common baseline models, including (but not limited to) uniform random graph models (Erdos-Renyi), random configuration models (Chung-Lu), etc. Such structure is not necessary to reproduce very simple network statistics, such as degree distributions or triangle closing probabilities. However, we show that this multi-scale local structure impacts, critically, the behavior of more complex network properties, in particular the effect of interventions such as quarantining; and it enables epidemic spread to be halted in realistic interaction networks, even when it cannot be halted in simple synthetic models. Key insights from our analysis include how epidemics on networks with widespread multi-scale local structure are easier to mitigate, as well as characterizing which nodes are ultimately not likely to be infected. We demonstrate that this structure results from more than just local triangle structure in the network, and we illustrate processes based on homophily or social influence and random walks that suggest how this multi-scale local structure arises. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2312.17351v1-abstract-full').style.display = 'none'; document.getElementById('2312.17351v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 December, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2312.04511</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> An LLM Compiler for Parallel Function Calling </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kim%2C+S">Sehoon Kim</a>, <a href="/search/cs?searchtype=author&amp;query=Moon%2C+S">Suhong Moon</a>, <a href="/search/cs?searchtype=author&amp;query=Tabrizi%2C+R">Ryan Tabrizi</a>, <a href="/search/cs?searchtype=author&amp;query=Lee%2C+N">Nicholas Lee</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Keutzer%2C+K">Kurt Keutzer</a>, <a href="/search/cs?searchtype=author&amp;query=Gholami%2C+A">Amir Gholami</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2312.04511v3-abstract-short" style="display: inline;"> The reasoning capabilities of the recent LLMs enable them to execute external function calls to overcome their inherent limitations, such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data. This development has allowed LLMs to select and coordinate multiple functions based on the context to tackle more complex problems. However, current methods for function calling oft&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2312.04511v3-abstract-full').style.display = 'inline'; document.getElementById('2312.04511v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2312.04511v3-abstract-full" style="display: none;"> The reasoning capabilities of the recent LLMs enable them to execute external function calls to overcome their inherent limitations, such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data. This development has allowed LLMs to select and coordinate multiple functions based on the context to tackle more complex problems. However, current methods for function calling often require sequential reasoning and acting for each function which can result in high latency, cost, and sometimes inaccurate behavior. To address this, we introduce LLMCompiler, which executes functions in parallel to efficiently orchestrate multiple function calls. Drawing inspiration from the principles of classical compilers, LLMCompiler enables parallel function calling with three components: (i) a Function Calling Planner, formulating execution plans for function calling; (ii) a Task Fetching Unit, dispatching function calling tasks; and (iii) an Executor, executing these tasks in parallel. LLMCompiler automatically generates an optimized orchestration for the function calls and can be used with both open-source and closed-source models. We have benchmarked LLMCompiler on a range of tasks with different patterns of function calling. We observe consistent latency speedup of up to 3.7x, cost savings of up to 6.7x, and accuracy improvement of up to ~9% compared to ReAct. Our code is available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2312.04511v3-abstract-full').style.display = 'none'; document.getElementById('2312.04511v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 7 December, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ICML 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2312.00359</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yefan Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Pang%2C+T">Tianyu Pang</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+K">Keqin Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Martin%2C+C+H">Charles H. Martin</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yaoqing Yang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2312.00359v1-abstract-short" style="display: inline;"> Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely ad&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2312.00359v1-abstract-full').style.display = 'inline'; document.getElementById('2312.00359v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2312.00359v1-abstract-full" style="display: none;"> Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalance is based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing. We implement TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using ResNets, VGGs, and WideResNets with various depths and widths. Our results show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art optimizers and learning rate schedulers. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2312.00359v1-abstract-full').style.display = 'none'; document.getElementById('2312.00359v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 December, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NeurIPS 2023 Spotlight, first two authors contributed equally</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.13028</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Distributed, Parallel, and Cluster Computing">cs.DC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Signal Processing">eess.SP</span> </div> </div> <p class="title is-5 mathjax"> DMLR: Data-centric Machine Learning Research -- Past, Present and Future </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Oala%2C+L">Luis Oala</a>, <a href="/search/cs?searchtype=author&amp;query=Maskey%2C+M">Manil Maskey</a>, <a href="/search/cs?searchtype=author&amp;query=Bat-Leah%2C+L">Lilith Bat-Leah</a>, <a href="/search/cs?searchtype=author&amp;query=Parrish%2C+A">Alicia Parrish</a>, <a href="/search/cs?searchtype=author&amp;query=G%C3%BCrel%2C+N+M">Nezihe Merve G眉rel</a>, <a href="/search/cs?searchtype=author&amp;query=Kuo%2C+T">Tzu-Sheng Kuo</a>, <a href="/search/cs?searchtype=author&amp;query=Liu%2C+Y">Yang Liu</a>, <a href="/search/cs?searchtype=author&amp;query=Dror%2C+R">Rotem Dror</a>, <a href="/search/cs?searchtype=author&amp;query=Brajovic%2C+D">Danilo Brajovic</a>, <a href="/search/cs?searchtype=author&amp;query=Yao%2C+X">Xiaozhe Yao</a>, <a href="/search/cs?searchtype=author&amp;query=Bartolo%2C+M">Max Bartolo</a>, <a href="/search/cs?searchtype=author&amp;query=Rojas%2C+W+A+G">William A Gaviria Rojas</a>, <a href="/search/cs?searchtype=author&amp;query=Hileman%2C+R">Ryan Hileman</a>, <a href="/search/cs?searchtype=author&amp;query=Aliment%2C+R">Rainier Aliment</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Risdal%2C+M">Meg Risdal</a>, <a href="/search/cs?searchtype=author&amp;query=Lease%2C+M">Matthew Lease</a>, <a href="/search/cs?searchtype=author&amp;query=Samek%2C+W">Wojciech Samek</a>, <a href="/search/cs?searchtype=author&amp;query=Dutta%2C+D">Debojyoti Dutta</a>, <a href="/search/cs?searchtype=author&amp;query=Northcutt%2C+C+G">Curtis G Northcutt</a>, <a href="/search/cs?searchtype=author&amp;query=Coleman%2C+C">Cody Coleman</a>, <a href="/search/cs?searchtype=author&amp;query=Hancock%2C+B">Braden Hancock</a>, <a href="/search/cs?searchtype=author&amp;query=Koch%2C+B">Bernard Koch</a>, <a href="/search/cs?searchtype=author&amp;query=Tadesse%2C+G+A">Girmaw Abebe Tadesse</a>, <a href="/search/cs?searchtype=author&amp;query=Karla%C5%A1%2C+B">Bojan Karla拧</a> , et al. (13 additional authors not shown) </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.13028v2-abstract-short" style="display: inline;"> Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods tow&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.13028v2-abstract-full').style.display = 'inline'; document.getElementById('2311.13028v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.13028v2-abstract-full" style="display: none;"> Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.13028v2-abstract-full').style.display = 'none'; document.getElementById('2311.13028v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 1 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 21 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Published in the Journal of Data-centric Machine Learning Research (DMLR) at</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2311.07013</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">ps</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> A PAC-Bayesian Perspective on the Interpolating Information Criterion </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Hodgkinson%2C+L">Liam Hodgkinson</a>, <a href="/search/cs?searchtype=author&amp;query=van+der+Heide%2C+C">Chris van der Heide</a>, <a href="/search/cs?searchtype=author&amp;query=Salomone%2C+R">Robert Salomone</a>, <a href="/search/cs?searchtype=author&amp;query=Roosta%2C+F">Fred Roosta</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2311.07013v1-abstract-short" style="display: inline;"> Deep learning is renowned for its theory-practice gap, whereby principled theory typically fails to provide much beneficial guidance for implementation in practice. This has been highlighted recently by the benign overfitting phenomenon: when neural networks become sufficiently large to interpolate the dataset perfectly, model performance appears to improve with increasing model size, in apparent&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.07013v1-abstract-full').style.display = 'inline'; document.getElementById('2311.07013v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2311.07013v1-abstract-full" style="display: none;"> Deep learning is renowned for its theory-practice gap, whereby principled theory typically fails to provide much beneficial guidance for implementation in practice. This has been highlighted recently by the benign overfitting phenomenon: when neural networks become sufficiently large to interpolate the dataset perfectly, model performance appears to improve with increasing model size, in apparent contradiction with the well-known bias-variance tradeoff. While such phenomena have proven challenging to theoretically study for general models, the recently proposed Interpolating Information Criterion (IIC) provides a valuable theoretical framework to examine performance for overparameterized models. Using the IIC, a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence generalization performance in the interpolating regime. From the provided bound, we quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, optimizer, and parameter-initialization scheme; the spectrum of the empirical neural tangent kernel; curvature of the loss landscape; and noise present in the data. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2311.07013v1-abstract-full').style.display = 'none'; document.getElementById('2311.07013v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 November, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> November 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">9 pages</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2310.05387</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Equation Discovery with Bayesian Spike-and-Slab Priors and Efficient Kernels </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Long%2C+D">Da Long</a>, <a href="/search/cs?searchtype=author&amp;query=Xing%2C+W+W">Wei W. Xing</a>, <a href="/search/cs?searchtype=author&amp;query=Krishnapriyan%2C+A+S">Aditi S. Krishnapriyan</a>, <a href="/search/cs?searchtype=author&amp;query=Kirby%2C+R+M">Robert M. Kirby</a>, <a href="/search/cs?searchtype=author&amp;query=Zhe%2C+S">Shandian Zhe</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2310.05387v2-abstract-short" style="display: inline;"> Discovering governing equations from data is important to many scientific and engineering applications. Despite promising successes, existing methods are still challenged by data sparsity and noise issues, both of which are ubiquitous in practice. Moreover, state-of-the-art methods lack uncertainty quantification and/or are costly in training. To overcome these limitations, we propose a novel equa&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.05387v2-abstract-full').style.display = 'inline'; document.getElementById('2310.05387v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2310.05387v2-abstract-full" style="display: none;"> Discovering governing equations from data is important to many scientific and engineering applications. Despite promising successes, existing methods are still challenged by data sparsity and noise issues, both of which are ubiquitous in practice. Moreover, state-of-the-art methods lack uncertainty quantification and/or are costly in training. To overcome these limitations, we propose a novel equation discovery method based on Kernel learning and BAyesian Spike-and-Slab priors (KBASS). We use kernel regression to estimate the target function, which is flexible, expressive, and more robust to data sparsity and noises. We combine it with a Bayesian spike-and-slab prior -- an ideal Bayesian sparse distribution -- for effective operator selection and uncertainty quantification. We develop an expectation-propagation expectation-maximization (EP-EM) algorithm for efficient posterior inference and function estimation. To overcome the computational challenge of kernel regression, we place the function values on a mesh and induce a Kronecker product construction, and we use tensor algebra to enable efficient computation and optimization. We show the advantages of KBASS on a list of benchmark ODE and PDE discovery tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.05387v2-abstract-full').style.display = 'none'; document.getElementById('2310.05387v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 21 April, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 8 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2310.02926</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Distributed, Parallel, and Cluster Computing">cs.DC</span> </div> </div> <p class="title is-5 mathjax"> Extensions to the SENSEI In situ Framework for Heterogeneous Architectures </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Loring%2C+B">Burlen Loring</a>, <a href="/search/cs?searchtype=author&amp;query=Bethel%2C+E+W">E. Wes Bethel</a>, <a href="/search/cs?searchtype=author&amp;query=Weber%2C+G+H">Gunther H. Weber</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2310.02926v1-abstract-short" style="display: inline;"> The proliferation of GPUs and accelerators in recent supercomputing systems, so called heterogeneous architectures, has led to increased complexity in execution environments and programming models as well as to deeper memory hierarchies on these systems. In this work, we discuss challenges that arise in in situ code coupling on these heterogeneous architectures. In particular, we present data and&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.02926v1-abstract-full').style.display = 'inline'; document.getElementById('2310.02926v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2310.02926v1-abstract-full" style="display: none;"> The proliferation of GPUs and accelerators in recent supercomputing systems, so called heterogeneous architectures, has led to increased complexity in execution environments and programming models as well as to deeper memory hierarchies on these systems. In this work, we discuss challenges that arise in in situ code coupling on these heterogeneous architectures. In particular, we present data and execution model extensions to the SENSEI in situ framework that are targeted at the effective use of systems with heterogeneous architectures. We then use these new data and execution model extensions to investigate several in situ placement and execution configurations and to analyze the impact these choices have on overall performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.02926v1-abstract-full').style.display = 'none'; document.getElementById('2310.02926v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">To appear in: ISAV 2023: In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization, November 13 2023</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">ACM Class:</span> I.6.6; E.1 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2310.02619</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Generative Modeling of Regular and Irregular Time Series Data via Koopman VAEs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Naiman%2C+I">Ilan Naiman</a>, <a href="/search/cs?searchtype=author&amp;query=Erichson%2C+N+B">N. Benjamin Erichson</a>, <a href="/search/cs?searchtype=author&amp;query=Ren%2C+P">Pu Ren</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Azencot%2C+O">Omri Azencot</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2310.02619v2-abstract-short" style="display: inline;"> Generating realistic time series data is important for many engineering and scientific applications. Existing work tackles this problem using generative adversarial networks (GANs). However, GANs are unstable during training, and they can suffer from mode collapse. While variational autoencoders (VAEs) are known to be more robust to the these issues, they are (surprisingly) less considered for tim&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.02619v2-abstract-full').style.display = 'inline'; document.getElementById('2310.02619v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2310.02619v2-abstract-full" style="display: none;"> Generating realistic time series data is important for many engineering and scientific applications. Existing work tackles this problem using generative adversarial networks (GANs). However, GANs are unstable during training, and they can suffer from mode collapse. While variational autoencoders (VAEs) are known to be more robust to the these issues, they are (surprisingly) less considered for time series generation. In this work, we introduce Koopman VAE (KoVAE), a new generative framework that is based on a novel design for the model prior, and that can be optimized for either regular and irregular training data. Inspired by Koopman theory, we represent the latent conditional prior dynamics using a linear map. Our approach enhances generative modeling with two desired features: (i) incorporating domain knowledge can be achieved by leveraging spectral tools that prescribe constraints on the eigenvalues of the linear map; and (ii) studying the qualitative behavior and stability of the system can be performed using tools from dynamical systems theory. Our results show that KoVAE outperforms state-of-the-art GAN and VAE methods across several challenging synthetic and real-world time series generation benchmarks. Whether trained on regular or irregular data, KoVAE generates time series that improve both discriminative and predictive metrics. We also present visual evidence suggesting that KoVAE learns probability density functions that better approximate the empirical ground truth distribution. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.02619v2-abstract-full').style.display = 'none'; document.getElementById('2310.02619v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 May, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 4 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Accepted to The Twelfth International Conference on Learning Representations, ICLR 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2310.01698</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Robustifying State-space Models for Long Sequences via Approximate Diagonalization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yu%2C+A">Annan Yu</a>, <a href="/search/cs?searchtype=author&amp;query=Nigmetov%2C+A">Arnur Nigmetov</a>, <a href="/search/cs?searchtype=author&amp;query=Morozov%2C+D">Dmitriy Morozov</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Erichson%2C+N+B">N. Benjamin Erichson</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2310.01698v1-abstract-short" style="display: inline;"> State-space models (SSMs) have recently emerged as a framework for learning long-range sequence tasks. An example is the structured state-space sequence (S4) layer, which uses the diagonal-plus-low-rank structure of the HiPPO initialization framework. However, the complicated structure of the S4 layer poses challenges; and, in an effort to address these challenges, models such as S4D and S5 have c&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.01698v1-abstract-full').style.display = 'inline'; document.getElementById('2310.01698v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2310.01698v1-abstract-full" style="display: none;"> State-space models (SSMs) have recently emerged as a framework for learning long-range sequence tasks. An example is the structured state-space sequence (S4) layer, which uses the diagonal-plus-low-rank structure of the HiPPO initialization framework. However, the complicated structure of the S4 layer poses challenges; and, in an effort to address these challenges, models such as S4D and S5 have considered a purely diagonal structure. This choice simplifies the implementation, improves computational efficiency, and allows channel communication. However, diagonalizing the HiPPO framework is itself an ill-posed problem. In this paper, we propose a general solution for this and related ill-posed diagonalization problems in machine learning. We introduce a generic, backward-stable &#34;perturb-then-diagonalize&#34; (PTD) methodology, which is based on the pseudospectral theory of non-normal operators, and which may be interpreted as the approximate diagonalization of the non-normal matrices defining SSMs. Based on this, we introduce the S4-PTD and S5-PTD models. Through theoretical analysis of the transfer functions of different initialization schemes, we demonstrate that the S4-PTD/S5-PTD initialization strongly converges to the HiPPO framework, while the S4D/S5 initialization only achieves weak convergences. As a result, our new models show resilience to Fourier-mode noise-perturbed inputs, a crucial property not achieved by the S4D/S5 models. In addition to improved robustness, our S5-PTD model averages 87.6% accuracy on the Long-Range Arena benchmark, demonstrating that the PTD methodology helps to improve the accuracy of deep learning models. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2310.01698v1-abstract-full').style.display = 'none'; document.getElementById('2310.01698v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2308.15720</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Surrogate-based Autotuning for Randomized Sketching Algorithms in Regression Problems </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Cho%2C+Y">Younghyun Cho</a>, <a href="/search/cs?searchtype=author&amp;query=Demmel%2C+J+W">James W. Demmel</a>, <a href="/search/cs?searchtype=author&amp;query=Derezi%C5%84ski%2C+M">Micha艂 Derezi艅ski</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+H">Haoyun Li</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+H">Hengrui Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Murray%2C+R+J">Riley J. Murray</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2308.15720v1-abstract-short" style="display: inline;"> Algorithms from Randomized Numerical Linear Algebra (RandNLA) are known to be effective in handling high-dimensional computational problems, providing high-quality empirical performance as well as strong probabilistic guarantees. However, their practical application is complicated by the fact that the user needs to set various algorithm-specific tuning parameters which are different than those use&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2308.15720v1-abstract-full').style.display = 'inline'; document.getElementById('2308.15720v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2308.15720v1-abstract-full" style="display: none;"> Algorithms from Randomized Numerical Linear Algebra (RandNLA) are known to be effective in handling high-dimensional computational problems, providing high-quality empirical performance as well as strong probabilistic guarantees. However, their practical application is complicated by the fact that the user needs to set various algorithm-specific tuning parameters which are different than those used in traditional NLA. This paper demonstrates how a surrogate-based autotuning approach can be used to address fundamental problems of parameter selection in RandNLA algorithms. In particular, we provide a detailed investigation of surrogate-based autotuning for sketch-and-precondition (SAP) based randomized least squares methods, which have been one of the great success stories in modern RandNLA. Empirical results show that our surrogate-based autotuning approach can achieve near-optimal performance with much less tuning cost than a random search (up to about 4x fewer trials of different parameter configurations). Moreover, while our experiments focus on least squares, our results demonstrate a general-purpose autotuning pipeline applicable to any kind of RandNLA algorithm. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2308.15720v1-abstract-full').style.display = 'none'; document.getElementById('2308.15720v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 29 August, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">MSC Class:</span> 68W20; 65F20; 65Y20 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2307.09797</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> Probabilistic Forecasting with Coherent Aggregation </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Olivares%2C+K+G">Kin G. Olivares</a>, <a href="/search/cs?searchtype=author&amp;query=N%C3%A9giar%2C+G">Geoffrey N茅giar</a>, <a href="/search/cs?searchtype=author&amp;query=Ma%2C+R">Ruijun Ma</a>, <a href="/search/cs?searchtype=author&amp;query=Meetei%2C+O+N">O. Nangba Meetei</a>, <a href="/search/cs?searchtype=author&amp;query=Cao%2C+M">Mengfei Cao</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2307.09797v3-abstract-short" style="display: inline;"> Obtaining accurate probabilistic forecasts is an important operational challenge in many applications, like energy management, climate forecast, supply chain planning, and resource allocation. In many of these applications, there is a natural hierarchical structure over the forecasted quantities; and forecasting systems that adhere to this hierarchical structure are said to be coherent. Furthermor&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2307.09797v3-abstract-full').style.display = 'inline'; document.getElementById('2307.09797v3-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2307.09797v3-abstract-full" style="display: none;"> Obtaining accurate probabilistic forecasts is an important operational challenge in many applications, like energy management, climate forecast, supply chain planning, and resource allocation. In many of these applications, there is a natural hierarchical structure over the forecasted quantities; and forecasting systems that adhere to this hierarchical structure are said to be coherent. Furthermore, operational planning benefits from accuracy at all levels of the aggregation hierarchy. Building accurate and coherent forecasting systems, however, is challenging: classic multivariate time series tools and neural network methods are still being adapted for this purpose. In this paper, we augment an MQForecaster neural network architecture with a novel deep Gaussian factor forecasting model that achieves coherence by construction, yielding a method we call the Deep Coherent Factor Model Neural Network (DeepCoFactor) model. DeepCoFactor generates samples that can be differentiated with respect to the model parameters, allowing optimization on various sample-based learning objectives that align with the forecasting system&#39;s goals, including quantile loss and the scaled Continuous Ranked Probability Score (CRPS). In a comparison to state-of-the-art coherent forecasting methods, DeepCoFactor achieves significant improvements in scaled CRPS forecast accuracy, with average gains of 15%, as measured on six publicly-available forecasting datasets. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2307.09797v3-abstract-full').style.display = 'none'; document.getElementById('2307.09797v3-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 November, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 19 July, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">10 pages of main text. Updated method and results</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2307.07785</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> The Interpolating Information Criterion for Overparameterized Models </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Hodgkinson%2C+L">Liam Hodgkinson</a>, <a href="/search/cs?searchtype=author&amp;query=van+der+Heide%2C+C">Chris van der Heide</a>, <a href="/search/cs?searchtype=author&amp;query=Salomone%2C+R">Robert Salomone</a>, <a href="/search/cs?searchtype=author&amp;query=Roosta%2C+F">Fred Roosta</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2307.07785v1-abstract-short" style="display: inline;"> The problem of model selection is considered for the setting of interpolating estimators, where the number of model parameters exceeds the size of the dataset. Classical information criteria typically consider the large-data limit, penalizing model size. However, these criteria are not appropriate in modern settings where overparameterized models tend to perform well. For any overparameterized mod&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2307.07785v1-abstract-full').style.display = 'inline'; document.getElementById('2307.07785v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2307.07785v1-abstract-full" style="display: none;"> The problem of model selection is considered for the setting of interpolating estimators, where the number of model parameters exceeds the size of the dataset. Classical information criteria typically consider the large-data limit, penalizing model size. However, these criteria are not appropriate in modern settings where overparameterized models tend to perform well. For any overparameterized model, we show that there exists a dual underparameterized model that possesses the same marginal likelihood, thus establishing a form of Bayesian duality. This enables more classical methods to be used in the overparameterized setting, revealing the Interpolating Information Criterion, a measure of model quality that naturally incorporates the choice of prior into the model selection. Our new information criterion accounts for prior misspecification, geometric and spectral properties of the model, and is numerically consistent with known empirical and theoretical behavior in this regime. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2307.07785v1-abstract-full').style.display = 'none'; document.getElementById('2307.07785v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 July, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">23 pages, 2 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2307.03595</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Artificial Intelligence">cs.AI</span> </div> </div> <p class="title is-5 mathjax"> GEANN: Scalable Graph Augmentations for Multi-Horizon Time Series Forecasting </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Yang%2C+S">Sitan Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Wolff%2C+M">Malcolm Wolff</a>, <a href="/search/cs?searchtype=author&amp;query=Ramasubramanian%2C+S">Shankar Ramasubramanian</a>, <a href="/search/cs?searchtype=author&amp;query=Quenneville-Belair%2C+V">Vincent Quenneville-Belair</a>, <a href="/search/cs?searchtype=author&amp;query=Metha%2C+R">Ronak Metha</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2307.03595v1-abstract-short" style="display: inline;"> Encoder-decoder deep neural networks have been increasingly studied for multi-horizon time series forecasting, especially in real-world applications. However, to forecast accurately, these sophisticated models typically rely on a large number of time series examples with substantial history. A rapidly growing topic of interest is forecasting time series which lack sufficient historical data -- oft&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2307.03595v1-abstract-full').style.display = 'inline'; document.getElementById('2307.03595v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2307.03595v1-abstract-full" style="display: none;"> Encoder-decoder deep neural networks have been increasingly studied for multi-horizon time series forecasting, especially in real-world applications. However, to forecast accurately, these sophisticated models typically rely on a large number of time series examples with substantial history. A rapidly growing topic of interest is forecasting time series which lack sufficient historical data -- often referred to as the ``cold start&#39;&#39; problem. In this paper, we introduce a novel yet simple method to address this problem by leveraging graph neural networks (GNNs) as a data augmentation for enhancing the encoder used by such forecasters. These GNN-based features can capture complex inter-series relationships, and their generation process can be optimized end-to-end with the forecasting task. We show that our architecture can use either data-driven or domain knowledge-defined graphs, scaling to incorporate information from multiple very large graphs with millions of nodes. In our target application of demand forecasting for a large e-commerce retailer, we demonstrate on both a small dataset of 100K products and a large dataset with over 2 million products that our method improves overall performance over competitive baseline models. More importantly, we show that it brings substantially more gains to ``cold start&#39;&#39; products such as those newly launched or recently out-of-stock. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2307.03595v1-abstract-full').style.display = 'none'; document.getElementById('2307.03595v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 July, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> July 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2306.14070</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computer Vision and Pattern Recognition">cs.CV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Image and Video Processing">eess.IV</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Computational Physics">physics.comp-ph</span> </div> </div> <p class="title is-5 mathjax"> SuperBench: A Super-Resolution Benchmark Dataset for Scientific Machine Learning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Ren%2C+P">Pu Ren</a>, <a href="/search/cs?searchtype=author&amp;query=Erichson%2C+N+B">N. Benjamin Erichson</a>, <a href="/search/cs?searchtype=author&amp;query=Subramanian%2C+S">Shashank Subramanian</a>, <a href="/search/cs?searchtype=author&amp;query=San%2C+O">Omer San</a>, <a href="/search/cs?searchtype=author&amp;query=Lukic%2C+Z">Zarija Lukic</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2306.14070v1-abstract-short" style="display: inline;"> Super-Resolution (SR) techniques aim to enhance data resolution, enabling the retrieval of finer details, and improving the overall quality and fidelity of the data representation. There is growing interest in applying SR methods to complex spatiotemporal systems within the Scientific Machine Learning (SciML) community, with the hope of accelerating numerical simulations and/or improving forecasts&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2306.14070v1-abstract-full').style.display = 'inline'; document.getElementById('2306.14070v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2306.14070v1-abstract-full" style="display: none;"> Super-Resolution (SR) techniques aim to enhance data resolution, enabling the retrieval of finer details, and improving the overall quality and fidelity of the data representation. There is growing interest in applying SR methods to complex spatiotemporal systems within the Scientific Machine Learning (SciML) community, with the hope of accelerating numerical simulations and/or improving forecasts in weather, climate, and related areas. However, the lack of standardized benchmark datasets for comparing and validating SR methods hinders progress and adoption in SciML. To address this, we introduce SuperBench, the first benchmark dataset featuring high-resolution datasets (up to $2048\times2048$ dimensions), including data from fluid flows, cosmology, and weather. Here, we focus on validating spatial SR performance from data-centric and physics-preserved perspectives, as well as assessing robustness to data degradation tasks. While deep learning-based SR methods (developed in the computer vision community) excel on certain tasks, despite relatively limited prior physics information, we identify limitations of these methods in accurately capturing intricate fine-scale features and preserving fundamental physical properties and constraints in scientific data. These shortcomings highlight the importance and subtlety of incorporating domain knowledge into ML models. We anticipate that SuperBench will significantly advance SR methods for scientific tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2306.14070v1-abstract-full').style.display = 'none'; document.getElementById('2306.14070v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 24 June, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2306.09262</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Programming Languages">cs.PL</span> </div> </div> <p class="title is-5 mathjax"> A Heavy-Tailed Algebra for Probabilistic Programming </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Liang%2C+F">Feynman Liang</a>, <a href="/search/cs?searchtype=author&amp;query=Hodgkinson%2C+L">Liam Hodgkinson</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2306.09262v1-abstract-short" style="display: inline;"> Despite the successes of probabilistic models based on passing noise through neural networks, recent work has identified that such methods often fail to capture tail behavior accurately, unless the tails of the base distribution are appropriately calibrated. To overcome this deficiency, we propose a systematic approach for analyzing the tails of random variables, and we illustrate how this approac&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2306.09262v1-abstract-full').style.display = 'inline'; document.getElementById('2306.09262v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2306.09262v1-abstract-full" style="display: none;"> Despite the successes of probabilistic models based on passing noise through neural networks, recent work has identified that such methods often fail to capture tail behavior accurately, unless the tails of the base distribution are appropriately calibrated. To overcome this deficiency, we propose a systematic approach for analyzing the tails of random variables, and we illustrate how this approach can be used during the static analysis (before drawing samples) pass of a probabilistic programming language compiler. To characterize how the tails change under various operations, we develop an algebra which acts on a three-parameter family of tail asymptotics and which is based on the generalized Gamma distribution. Our algebraic operations are closed under addition and multiplication; they are capable of distinguishing sub-Gaussians with differing scales; and they handle ratios sufficiently well to reproduce the tails of most important statistical distributions directly from their definitions. Our empirical results confirm that inference algorithms that leverage our heavy-tailed algebra attain superior performance across a number of density modeling and variational inference tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2306.09262v1-abstract-full').style.display = 'none'; document.getElementById('2306.09262v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 15 June, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">21 pages, 6 figures</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2306.07629</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> SqueezeLLM: Dense-and-Sparse Quantization </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kim%2C+S">Sehoon Kim</a>, <a href="/search/cs?searchtype=author&amp;query=Hooper%2C+C">Coleman Hooper</a>, <a href="/search/cs?searchtype=author&amp;query=Gholami%2C+A">Amir Gholami</a>, <a href="/search/cs?searchtype=author&amp;query=Dong%2C+Z">Zhen Dong</a>, <a href="/search/cs?searchtype=author&amp;query=Li%2C+X">Xiuyu Li</a>, <a href="/search/cs?searchtype=author&amp;query=Shen%2C+S">Sheng Shen</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Keutzer%2C+K">Kurt Keutzer</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2306.07629v4-abstract-short" style="display: inline;"> Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models.&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2306.07629v4-abstract-full').style.display = 'inline'; document.getElementById('2306.07629v4-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2306.07629v4-abstract-full" style="display: none;"> Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is available at <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2306.07629v4-abstract-full').style.display = 'none'; document.getElementById('2306.07629v4-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 4 June, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 13 June, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> June 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ICML 2024</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.18383</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> A Three-regime Model of Network Pruning </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Zhou%2C+Y">Yefan Zhou</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yaoqing Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Chang%2C+A">Arin Chang</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.18383v1-abstract-short" style="display: inline;"> Recent work has highlighted the complex influence training hyperparameters, e.g., the number of training epochs, can have on the prunability of machine learning models. Perhaps surprisingly, a systematic approach to predict precisely how adjusting a specific hyperparameter will affect prunability remains elusive. To address this gap, we introduce a phenomenological model grounded in the statistica&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.18383v1-abstract-full').style.display = 'inline'; document.getElementById('2305.18383v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.18383v1-abstract-full" style="display: none;"> Recent work has highlighted the complex influence training hyperparameters, e.g., the number of training epochs, can have on the prunability of machine learning models. Perhaps surprisingly, a systematic approach to predict precisely how adjusting a specific hyperparameter will affect prunability remains elusive. To address this gap, we introduce a phenomenological model grounded in the statistical mechanics of learning. Our approach uses temperature-like and load-like parameters to model the impact of neural network (NN) training hyperparameters on pruning performance. A key empirical result we identify is a sharp transition phenomenon: depending on the value of a load-like parameter in the pruned model, increasing the value of a temperature-like parameter in the pre-pruned model may either enhance or impair subsequent pruning performance. Based on this transition, we build a three-regime model by taxonomizing the global structure of the pruned NN loss landscape. Our model reveals that the dichotomous effect of high temperature is associated with transitions between distinct types of global structures in the post-pruned model. Based on our results, we present three case-studies: 1) determining whether to increase or decrease a hyperparameter for improved pruning; 2) selecting the best model to prune from a family of models; and 3) tuning the hyperparameter of the Sharpness Aware Minimization method for better pruning performance. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.18383v1-abstract-full').style.display = 'none'; document.getElementById('2305.18383v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ICML 2023</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> Proceedings of the 40th International Conference on Machine Learning, PMLR 202:42790-42809, 2023 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.18379</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Optimization and Control">math.OC</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Numerical Analysis">math.NA</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Constrained Optimization via Exact Augmented Lagrangian and Randomized Iterative Sketching </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Hong%2C+I">Ilgee Hong</a>, <a href="/search/cs?searchtype=author&amp;query=Na%2C+S">Sen Na</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Kolar%2C+M">Mladen Kolar</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.18379v1-abstract-short" style="display: inline;"> We consider solving equality-constrained nonlinear, nonconvex optimization problems. This class of problems appears widely in a variety of applications in machine learning and engineering, ranging from constrained deep neural networks, to optimal control, to PDE-constrained optimization. We develop an adaptive inexact Newton method for this problem class. In each iteration, we solve the Lagrangian&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.18379v1-abstract-full').style.display = 'inline'; document.getElementById('2305.18379v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.18379v1-abstract-full" style="display: none;"> We consider solving equality-constrained nonlinear, nonconvex optimization problems. This class of problems appears widely in a variety of applications in machine learning and engineering, ranging from constrained deep neural networks, to optimal control, to PDE-constrained optimization. We develop an adaptive inexact Newton method for this problem class. In each iteration, we solve the Lagrangian Newton system inexactly via a randomized iterative sketching solver, and select a suitable stepsize by performing line search on an exact augmented Lagrangian merit function. The randomized solvers have advantages over deterministic linear system solvers by significantly reducing per-iteration flops complexity and storage cost, when equipped with suitable sketching matrices. Our method adaptively controls the accuracy of the randomized solver and the penalty parameters of the exact augmented Lagrangian, to ensure that the inexact Newton direction is a descent direction of the exact augmented Lagrangian. This allows us to establish a global almost sure convergence. We also show that a unit stepsize is admissible locally, so that our method exhibits a local linear convergence. Furthermore, we prove that the linear convergence can be strengthened to superlinear convergence if we gradually sharpen the adaptive accuracy condition on the randomized solver. We demonstrate the superior performance of our method on benchmark nonlinear problems in CUTEst test set, constrained logistic regression with data from LIBSVM, and a PDE-constrained problem. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.18379v1-abstract-full').style.display = 'none'; document.getElementById('2305.18379v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 28 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">25 pages, 4 figures</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> ICML 2023 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2305.12313</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> When are ensembles really effective? </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Theisen%2C+R">Ryan Theisen</a>, <a href="/search/cs?searchtype=author&amp;query=Kim%2C+H">Hyunsuk Kim</a>, <a href="/search/cs?searchtype=author&amp;query=Yang%2C+Y">Yaoqing Yang</a>, <a href="/search/cs?searchtype=author&amp;query=Hodgkinson%2C+L">Liam Hodgkinson</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2305.12313v1-abstract-short" style="display: inline;"> Ensembling has a long history in statistical data analysis, with many impactful applications. However, in many modern machine learning settings, the benefits of ensembling are less ubiquitous and less obvious. We study, both theoretically and empirically, the fundamental question of when ensembling yields significant performance improvements in classification tasks. Theoretically, we prove new res&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.12313v1-abstract-full').style.display = 'inline'; document.getElementById('2305.12313v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2305.12313v1-abstract-full" style="display: none;"> Ensembling has a long history in statistical data analysis, with many impactful applications. However, in many modern machine learning settings, the benefits of ensembling are less ubiquitous and less obvious. We study, both theoretically and empirically, the fundamental question of when ensembling yields significant performance improvements in classification tasks. Theoretically, we prove new results relating the \emph{ensemble improvement rate} (a measure of how much ensembling decreases the error rate versus a single model, on a relative scale) to the \emph{disagreement-error ratio}. We show that ensembling improves performance significantly whenever the disagreement rate is large relative to the average error rate; and that, conversely, one classifier is often enough whenever the disagreement rate is low relative to the average error rate. On the way to proving these results, we derive, under a mild condition called \emph{competence}, improved upper and lower bounds on the average test error rate of the majority vote classifier. To complement this theory, we study ensembling empirically in a variety of settings, verifying the predictions made by our theory, and identifying practical scenarios where ensembling does and does not result in large performance improvements. Perhaps most notably, we demonstrate a distinct difference in behavior between interpolating models (popular in current practice) and non-interpolating models (such as tree-based methods, where ensembling is popular), demonstrating that ensembling helps considerably more in the latter case than in the former. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2305.12313v1-abstract-full').style.display = 'none'; document.getElementById('2305.12313v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 20 May, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> May 2023. </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2304.06745</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Hardware Architecture">cs.AR</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="High Energy Physics - Experiment">hep-ex</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Instrumentation and Detectors">physics.ins-det</span> </div> </div> <p class="title is-5 mathjax"> End-to-end codesign of Hessian-aware quantized neural networks for FPGAs and ASICs </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Campos%2C+J">Javier Campos</a>, <a href="/search/cs?searchtype=author&amp;query=Dong%2C+Z">Zhen Dong</a>, <a href="/search/cs?searchtype=author&amp;query=Duarte%2C+J">Javier Duarte</a>, <a href="/search/cs?searchtype=author&amp;query=Gholami%2C+A">Amir Gholami</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Mitrevski%2C+J">Jovan Mitrevski</a>, <a href="/search/cs?searchtype=author&amp;query=Tran%2C+N">Nhan Tran</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2304.06745v1-abstract-short" style="display: inline;"> We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpi&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2304.06745v1-abstract-full').style.display = 'inline'; document.getElementById('2304.06745v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2304.06745v1-abstract-full" style="display: none;"> We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) for efficient field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) hardware. Our approach leverages Hessian-aware quantization (HAWQ) of NNs, the Quantized Open Neural Network Exchange (QONNX) intermediate representation, and the hls4ml tool flow for transpiling NNs into FPGA and ASIC firmware. This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow that can be deployed for real-time machine learning applications in a wide range of scientific and industrial settings. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the CERN Large Hadron Collider (LHC). Given the high collision rate, all data processing must be implemented on custom ASIC and FPGA hardware within a strict area and latency. Based on these constraints, we implement an optimized mixed-precision NN classifier for high-momentum particle jets in simulated LHC proton-proton collisions. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2304.06745v1-abstract-full').style.display = 'none'; document.getElementById('2304.06745v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 13 April, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> April 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">19 pages, 6 figures, 2 tables</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Report number:</span> FERMILAB-PUB-23-150-CSAID-ETD </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2302.14017</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> </div> </div> <p class="title is-5 mathjax"> Full Stack Optimization of Transformer Inference: a Survey </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kim%2C+S">Sehoon Kim</a>, <a href="/search/cs?searchtype=author&amp;query=Hooper%2C+C">Coleman Hooper</a>, <a href="/search/cs?searchtype=author&amp;query=Wattanawong%2C+T">Thanakul Wattanawong</a>, <a href="/search/cs?searchtype=author&amp;query=Kang%2C+M">Minwoo Kang</a>, <a href="/search/cs?searchtype=author&amp;query=Yan%2C+R">Ruohan Yan</a>, <a href="/search/cs?searchtype=author&amp;query=Genc%2C+H">Hasan Genc</a>, <a href="/search/cs?searchtype=author&amp;query=Dinh%2C+G">Grace Dinh</a>, <a href="/search/cs?searchtype=author&amp;query=Huang%2C+Q">Qijing Huang</a>, <a href="/search/cs?searchtype=author&amp;query=Keutzer%2C+K">Kurt Keutzer</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Shao%2C+Y+S">Yakun Sophia Shao</a>, <a href="/search/cs?searchtype=author&amp;query=Gholami%2C+A">Amir Gholami</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2302.14017v1-abstract-short" style="display: inline;"> Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2302.14017v1-abstract-full').style.display = 'inline'; document.getElementById('2302.14017v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2302.14017v1-abstract-full" style="display: none;"> Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2302.14017v1-abstract-full').style.display = 'none'; document.getElementById('2302.14017v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 February, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> February 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> Presented in Workshop on Architecture and System Support for Transformer Models (ASSYST) at ISCA 2023 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2302.11474</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Numerical Analysis">math.NA</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Mathematical Software">cs.MS</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Optimization and Control">math.OC</span> </div> </div> <p class="title is-5 mathjax"> Randomized Numerical Linear Algebra : A Perspective on the Field With an Eye to Software </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Murray%2C+R">Riley Murray</a>, <a href="/search/cs?searchtype=author&amp;query=Demmel%2C+J">James Demmel</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Erichson%2C+N+B">N. Benjamin Erichson</a>, <a href="/search/cs?searchtype=author&amp;query=Melnichenko%2C+M">Maksim Melnichenko</a>, <a href="/search/cs?searchtype=author&amp;query=Malik%2C+O+A">Osman Asif Malik</a>, <a href="/search/cs?searchtype=author&amp;query=Grigori%2C+L">Laura Grigori</a>, <a href="/search/cs?searchtype=author&amp;query=Luszczek%2C+P">Piotr Luszczek</a>, <a href="/search/cs?searchtype=author&amp;query=Derezi%C5%84ski%2C+M">Micha艂 Derezi艅ski</a>, <a href="/search/cs?searchtype=author&amp;query=Lopes%2C+M+E">Miles E. Lopes</a>, <a href="/search/cs?searchtype=author&amp;query=Liang%2C+T">Tianyu Liang</a>, <a href="/search/cs?searchtype=author&amp;query=Luo%2C+H">Hengrui Luo</a>, <a href="/search/cs?searchtype=author&amp;query=Dongarra%2C+J">Jack Dongarra</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2302.11474v2-abstract-short" style="display: inline;"> Randomized numerical linear algebra - RandNLA, for short - concerns the use of randomization as a resource to develop improved algorithms for large-scale linear algebra computations. The origins of contemporary RandNLA lay in theoretical computer science, where it blossomed from a simple idea: randomization provides an avenue for computing approximate solutions to linear algebra problems more ef&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2302.11474v2-abstract-full').style.display = 'inline'; document.getElementById('2302.11474v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2302.11474v2-abstract-full" style="display: none;"> Randomized numerical linear algebra - RandNLA, for short - concerns the use of randomization as a resource to develop improved algorithms for large-scale linear algebra computations. The origins of contemporary RandNLA lay in theoretical computer science, where it blossomed from a simple idea: randomization provides an avenue for computing approximate solutions to linear algebra problems more efficiently than deterministic algorithms. This idea proved fruitful in the development of scalable algorithms for machine learning and statistical data analysis applications. However, RandNLA&#39;s true potential only came into focus upon integration with the fields of numerical analysis and &#34;classical&#34; numerical linear algebra. Through the efforts of many individuals, randomized algorithms have been developed that provide full control over the accuracy of their solutions and that can be every bit as reliable as algorithms that might be found in libraries such as LAPACK. Recent years have even seen the incorporation of certain RandNLA methods into MATLAB, the NAG Library, NVIDIA&#39;s cuSOLVER, and SciKit-Learn. For all its success, we believe that RandNLA has yet to realize its full potential. In particular, we believe the scientific community stands to benefit significantly from suitably defined &#34;RandBLAS&#34; and &#34;RandLAPACK&#34; libraries, to serve as standards conceptually analogous to BLAS and LAPACK. This 200-page monograph represents a step toward defining such standards. In it, we cover topics spanning basic sketching, least squares and optimization, low-rank approximation, full matrix decompositions, leverage score sampling, and sketching data with tensor product structures (among others). Much of the provided pseudo-code has been tested via publicly available MATLAB and Python implementations. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2302.11474v2-abstract-full').style.display = 'none'; document.getElementById('2302.11474v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 April, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 22 February, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> February 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">v1: this is the first arXiv release of LAPACK Working Note 299. v2: complete rewrite of the subsection on trace estimation, among other changes. See frontmatter page ii (pdf page 5) for revision history</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2302.11002</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Analysis of PDEs">math.AP</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Numerical Analysis">math.NA</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1016/j.physd.2023.133952 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Learning Physical Models that Can Respect Conservation Laws </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Hansen%2C+D">Derek Hansen</a>, <a href="/search/cs?searchtype=author&amp;query=Maddix%2C+D+C">Danielle C. Maddix</a>, <a href="/search/cs?searchtype=author&amp;query=Alizadeh%2C+S">Shima Alizadeh</a>, <a href="/search/cs?searchtype=author&amp;query=Gupta%2C+G">Gaurav Gupta</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2302.11002v4-abstract-short" style="display: inline;"> Recent work in scientific machine learning (SciML) has focused on incorporating partial differential equation (PDE) information into the learning process. Much of this work has focused on relatively &#34;easy&#34; PDE operators (e.g., elliptic and parabolic), with less emphasis on relatively &#34;hard&#34; PDE operators (e.g., hyperbolic). Within numerical PDEs, the latter problem class requires control of a type&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2302.11002v4-abstract-full').style.display = 'inline'; document.getElementById('2302.11002v4-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2302.11002v4-abstract-full" style="display: none;"> Recent work in scientific machine learning (SciML) has focused on incorporating partial differential equation (PDE) information into the learning process. Much of this work has focused on relatively &#34;easy&#34; PDE operators (e.g., elliptic and parabolic), with less emphasis on relatively &#34;hard&#34; PDE operators (e.g., hyperbolic). Within numerical PDEs, the latter problem class requires control of a type of volume element or conservation constraint, which is known to be challenging. Delivering on the promise of SciML requires seamlessly incorporating both types of problems into the learning process. To address this issue, we propose ProbConserv, a framework for incorporating conservation constraints into a generic SciML architecture. To do so, ProbConserv combines the integral form of a conservation law with a Bayesian update. We provide a detailed analysis of ProbConserv on learning with the Generalized Porous Medium Equation (GPME), a widely-applicable parameterized family of PDEs that illustrates the qualitative properties of both easier and harder PDEs. ProbConserv is effective for easy GPME variants, performing well with state-of-the-art competitors; and for harder GPME variants it outperforms other approaches that do not guarantee volume conservation. ProbConserv seamlessly enforces physical conservation constraints, maintains probabilistic uncertainty quantification (UQ), and deals well with shocks and heteroscedasticities. In each case, it achieves superior predictive performance on downstream tasks. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2302.11002v4-abstract-full').style.display = 'none'; document.getElementById('2302.11002v4-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 10 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 21 February, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> February 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">ICML 2023, Physica D: Nonlinear Phenomena, Accepted</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> Physica D: Nonlinear Phenomena, 457 (2024) 133952 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2302.07863</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> </div> </div> <p class="title is-5 mathjax"> Speculative Decoding with Big Little Decoder </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Kim%2C+S">Sehoon Kim</a>, <a href="/search/cs?searchtype=author&amp;query=Mangalam%2C+K">Karttikeya Mangalam</a>, <a href="/search/cs?searchtype=author&amp;query=Moon%2C+S">Suhong Moon</a>, <a href="/search/cs?searchtype=author&amp;query=Malik%2C+J">Jitendra Malik</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a>, <a href="/search/cs?searchtype=author&amp;query=Gholami%2C+A">Amir Gholami</a>, <a href="/search/cs?searchtype=author&amp;query=Keutzer%2C+K">Kurt Keutzer</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2302.07863v4-abstract-short" style="display: inline;"> The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks,&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2302.07863v4-abstract-full').style.display = 'inline'; document.getElementById('2302.07863v4-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2302.07863v4-abstract-full" style="display: none;"> The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model&#39;s inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to correct the small model&#39;s inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation. Furthermore, our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture. Our code is open-sourced <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2302.07863v4-abstract-full').style.display = 'none'; document.getElementById('2302.07863v4-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 12 October, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 15 February, 2023; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> February 2023. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">NeurIPS 2023</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2212.00228</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Neural and Evolutionary Computing">cs.NE</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Gated Recurrent Neural Networks with Weighted Time-Delay Feedback </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/cs?searchtype=author&amp;query=Erichson%2C+N+B">N. Benjamin Erichson</a>, <a href="/search/cs?searchtype=author&amp;query=Lim%2C+S+H">Soon Hoe Lim</a>, <a href="/search/cs?searchtype=author&amp;query=Mahoney%2C+M+W">Michael W. Mahoney</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2212.00228v1-abstract-short" style="display: inline;"> We introduce a novel gated recurrent unit (GRU) with a weighted time-delay feedback mechanism in order to improve the modeling of long-term dependencies in sequential data. This model is a discretized version of a continuous-time formulation of a recurrent unit, where the dynamics are governed by delay differential equations (DDEs). By considering a suitable time-discretization scheme, we propose&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.00228v1-abstract-full').style.display = 'inline'; document.getElementById('2212.00228v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2212.00228v1-abstract-full" style="display: none;"> We introduce a novel gated recurrent unit (GRU) with a weighted time-delay feedback mechanism in order to improve the modeling of long-term dependencies in sequential data. This model is a discretized version of a continuous-time formulation of a recurrent unit, where the dynamics are governed by delay differential equations (DDEs). By considering a suitable time-discretization scheme, we propose $蟿$-GRU, a discrete-time gated recurrent unit with delay. We prove the existence and uniqueness of solutions for the continuous-time model, and we demonstrate that the proposed feedback mechanism can help improve the modeling of long-term dependencies. Our empirical results show that $蟿$-GRU can converge faster and generalize better than state-of-the-art recurrent units and gated recurrent architectures on a range of tasks, including time-series classification, human activity recognition, and speech recognition. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2212.00228v1-abstract-full').style.display = 'none'; document.getElementById('2212.00228v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 30 November, 2022; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> December 2022. </p> </li> </ol> <nav class="pagination is-small is-centered breathe-horizontal" role="navigation" aria-label="pagination"> <a href="" class="pagination-previous is-invisible">Previous </a> <a href="/search/?searchtype=author&amp;query=Mahoney%2C+M+W&amp;start=50" class="pagination-next" >Next </a> <ul class="pagination-list"> <li> <a href="/search/?searchtype=author&amp;query=Mahoney%2C+M+W&amp;start=0" class="pagination-link is-current" aria-label="Goto page 1">1 </a> </li> <li> <a href="/search/?searchtype=author&amp;query=Mahoney%2C+M+W&amp;start=50" class="pagination-link " aria-label="Page 2" aria-current="page">2 </a> </li> <li> <a href="/search/?searchtype=author&amp;query=Mahoney%2C+M+W&amp;start=100" class="pagination-link " aria-label="Page 3" aria-current="page">3 </a> </li> <li> <a href="/search/?searchtype=author&amp;query=Mahoney%2C+M+W&amp;start=150" class="pagination-link " aria-label="Page 4" aria-current="page">4 </a> </li> <li> <a href="/search/?searchtype=author&amp;query=Mahoney%2C+M+W&amp;start=200" class="pagination-link " aria-label="Page 5" aria-current="page">5 </a> </li> </ul> </nav> 