CINXE.COM
JMLR
<?xml version="1.0" encoding="utf-8"?> <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> <channel> <atom:link href="http://jmlr.org/jmlr.xml" rel="self" type="application/rss+xml" /> <link>http://www.jmlr.org</link> <title>JMLR</title> <description>Journal of Machine Learning Research</description> <item> <title> Random ReLU Neural Networks as Non-Gaussian Processes </title> <link> http://jmlr.org/papers/v26/24-0737.html </link> <pdf> http://jmlr.org/papers/volume26/24-0737/24-0737.pdf </pdf> <pubDate>2025</pubDate> <author>Rahul Parhi, Pakshal Bohra, Ayoub El Biari, Mehrsa Pourya, Michael Unser</author> <description> We consider a large class of shallow neural networks with randomly initialized parameters and rectified linear unit activation functions. We prove that these random neural networks are well-defined non-Gaussian processes. As a by-product, we demonstrate that these networks are solutions to stochastic differential equations driven by impulsive white noise (combinations of random Dirac measures). These processes are parameterized by the law of the weights and biases as well as the density of activation thresholds in each bounded region of the input domain. We prove that these processes are isotropic and wide-sense self-similar with Hurst exponent 3/2. We also derive a remarkably simple closed-form expression for their autocovariance function. Our results are fundamentally different from prior work in that we consider a non-asymptotic viewpoint: The number of neurons in each bounded region of the input domain (i.e., the width) is itself a random variable with a Poisson law with mean proportional to the density parameter. Finally, we show that, under suitable hypotheses, as the expected width tends to infinity, these processes can converge in law not only to Gaussian processes, but also to non-Gaussian processes depending on the law of the weights. Our asymptotic results provide a new take on several classical results (wide networks converge to Gaussian processes) as well as some new ones (wide networks can converge to non-Gaussian processes). </description> </item> <item> <title> Riemannian Bilevel Optimization </title> <link> http://jmlr.org/papers/v26/24-0397.html </link> <pdf> http://jmlr.org/papers/volume26/24-0397/24-0397.pdf </pdf> <pubDate>2025</pubDate> <author>Jiaxiang Li, Shiqian Ma</author> <description> In this work, we consider the bilevel optimization problem on Riemannian manifolds. We inspect the calculation of the hypergradient of such problems on general manifolds and thus enable the utilization of gradient-based algorithms to solve such problems. The calculation of the hypergradient requires utilizing the notion of Riemannian cross-derivative and we inspect the properties and the numerical calculations of Riemannian cross-derivatives. Algorithms in both deterministic and stochastic settings, named respectively RieBO and RieSBO, are proposed that include the existing Euclidean bilevel optimization algorithms as special cases. Numerical experiments on robust optimization on Riemannian manifolds are presented to show the applicability and efficiency of the proposed methods. </description> </item> <item> <title> Supervised Learning with Evolving Tasks and Performance Guarantees </title> <link> http://jmlr.org/papers/v26/24-0343.html </link> <pdf> http://jmlr.org/papers/volume26/24-0343/24-0343.pdf </pdf> <pubDate>2025</pubDate> <author>Verónica Álvarez, Santiago Mazuelas, Jose A. Lozano</author> <description> Multiple supervised learning scenarios are composed by a sequence of classification tasks. For instance, multi-task learning and continual learning aim to learn a sequence of tasks that is either fixed or grows over time. Existing techniques for learning tasks that are in a sequence are tailored to specific scenarios, lacking adaptability to others. In addition, most of existing techniques consider situations in which the order of the tasks in the sequence is not relevant. However, it is common that tasks in a sequence are evolving in the sense that consecutive tasks often have a higher similarity. This paper presents a learning methodology that is applicable to multiple supervised learning scenarios and adapts to evolving tasks. Differently from existing techniques, we provide computable tight performance guarantees and analytically characterize the increase in the effective sample size. Experiments on benchmark datasets show the performance improvement of the proposed methodology in multiple scenarios and the reliability of the presented performance guarantees. </description> </item> <item> <title> Error estimation and adaptive tuning for unregularized robust M-estimator </title> <link> http://jmlr.org/papers/v26/24-0060.html </link> <pdf> http://jmlr.org/papers/volume26/24-0060/24-0060.pdf </pdf> <pubDate>2025</pubDate> <author>Pierre C. Bellec, Takuya Koriyama</author> <description> We consider unregularized robust M-estimators for linear models under Gaussian design and heavy-tailed noise, in the proportional asymptotics regime where the sample size n and the number of features p are both increasing such that $p/n \to \gamma\in (0,1)$. An estimator of the out-of-sample error of a robust M-estimator is analyzed and proved to be consistent for a large family of loss functions that includes the Huber loss. As an application of this result, we propose an adaptive tuning procedure of the scale parameter $\lambda>0$ of a given loss function $\rho$: choosing $\hat \lambda$ in a given interval $I$ that minimizes the out-of-sample error estimate of the M-estimator constructed with loss $\rho_\lambda(\cdot) = \lambda^2 \rho(\cdot/\lambda)$ leads to the optimal out-of-sample error over $I$. The proof relies on a smoothing argument: the unregularized M-estimation objective function is perturbed, or smoothed, with a Ridge penalty that vanishes as $n\to+\infty$, and shows that the unregularized M-estimator of interest inherits properties of its smoothed version. </description> </item> <item> <title> From Sparse to Dense Functional Data in High Dimensions: Revisiting Phase Transitions from a Non-Asymptotic Perspective </title> <link> http://jmlr.org/papers/v26/23-1578.html </link> <pdf> http://jmlr.org/papers/volume26/23-1578/23-1578.pdf </pdf> <pubDate>2025</pubDate> <author>Shaojun Guo, Dong Li, Xinghao Qiao, Yizhu Wang</author> <description> Nonparametric estimation of the mean and covariance functions is ubiquitous in functional data analysis and local linear smoothing techniques are most frequently used. Zhang and Wang (2016) explored different types of asymptotic properties of the estimation, which reveal interesting phase transition phenomena based on the relative order of the average sampling frequency per subject $T$ to the number of subjects $n$, partitioning the data into three categories: “sparse”, “semi-dense”, and “ultra-dense”. In an increasingly available high-dimensional scenario, where the number of functional variables $p$ is large in relation to $n$, we revisit this open problem from a non-asymptotic perspective by deriving comprehensive concentration inequalities for the local linear smoothers. Besides being of interest by themselves, our non-asymptotic results lead to elementwise maximum rates of $L_2$ convergence and uniform convergence serving as a fundamentally important tool for further convergence analysis when $p$ grows exponentially with $n$ and possibly $T$. With the presence of extra $\log p$ terms to account for the high-dimensional effect, we then investigate the scaled phase transitions and the corresponding elementwise maximum rates from sparse to semi-dense to ultra-dense functional data in high dimensions. We also discuss a couple of applications of our theoretical results. Finally, numerical studies are carried out to confirm the established theoretical properties. </description> </item> <item> <title> Locally Private Causal Inference for Randomized Experiments </title> <link> http://jmlr.org/papers/v26/23-1401.html </link> <pdf> http://jmlr.org/papers/volume26/23-1401/23-1401.pdf </pdf> <pubDate>2025</pubDate> <author>Yuki Ohnishi, Jordan Awan</author> <description> Local differential privacy is a differential privacy paradigm in which individuals first apply a privacy mechanism to their data (often by adding noise) before transmitting the result to a curator. The noise for privacy results in additional bias and variance in their analyses. Thus it is of great importance for analysts to incorporate the privacy noise into valid inference. In this article, we develop methodologies to infer causal effects from locally privatized data under randomized experiments. First, we present frequentist estimators under various privacy scenarios with their variance estimators and plug-in confidence intervals. We show a na\"ive debiased estimator results in inferior mean-squared error (MSE) compared to minimax lower bounds. In contrast, we show that using a customized privacy mechanism, we can match the lower bound, giving minimax optimal inference. We also develop a Bayesian nonparametric methodology along with a blocked Gibbs sampling algorithm, which can be applied to any of our proposed privacy mechanisms, and which performs especially well in terms of MSE for tight privacy budgets. Finally, we present simulation studies to evaluate the performance of our proposed frequentist and Bayesian methodologies for various privacy budgets, resulting in useful suggestions for performing causal inference for privatized data. </description> </item> <item> <title> Estimating Network-Mediated Causal Effects via Principal Components Network Regression </title> <link> http://jmlr.org/papers/v26/23-1317.html </link> <pdf> http://jmlr.org/papers/volume26/23-1317/23-1317.pdf </pdf> <pubDate>2025</pubDate> <author>Alex Hayes, Mark M. Fredrickson, Keith Levin</author> <description> We develop a method to decompose causal effects on a social network into an indirect effect mediated by the network, and a direct effect independent of the social network. To handle the complexity of network structures, we assume that latent social groups act as causal mediators. We develop principal components network regression models to differentiate the social effect from the non-social effect. Fitting the regression models is as simple as principal components analysis followed by ordinary least squares estimation. We prove asymptotic theory for regression coefficients from this procedure and show that it is widely applicable, allowing for a variety of distributions on the regression errors and network edges. We carefully characterize the counterfactual assumptions necessary to use the regression models for causal inference, and show that current approaches to causal network regression may result in over-control bias. The method is very general, so that it is applicable to many types of structured data beyond social networks, such as text, areal data, psychometrics, images and omics. </description> </item> <item> <title> Selective Inference with Distributed Data </title> <link> http://jmlr.org/papers/v26/23-0309.html </link> <pdf> http://jmlr.org/papers/volume26/23-0309/23-0309.pdf </pdf> <pubDate>2025</pubDate> <author>Sifan Liu, Snigdha Panigrahi</author> <description> When data are distributed across multiple sites or machines rather than centralized in one location, researchers face the challenge of extracting meaningful information without directly sharing individual data points. While there are many distributed methods for point estimation using sparse regression, few options are available for estimating uncertainties or conducting hypothesis tests based on the estimated sparsity. In this paper, we introduce a procedure for performing selective inference with distributed data. We consider a scenario where each local machine solves a lasso problem and communicates the selected predictors to a central machine. The central machine then aggregates these selected predictors to form a generalized linear model (GLM). Our goal is to provide valid inference for the selected GLM while reusing data that have been used in the model selection process. Our proposed procedure only requires low-dimensional summary statistics from local machines, thus keeping communication costs low and preserving the privacy of individual data sets. Furthermore, this procedure can be applied in scenarios where model selection is repeatedly conducted on randomly subsampled data sets, addressing the p-value lottery problem linked with model selection. We demonstrate the effectiveness of our approach through simulations and an analysis of a medical data set on ICU admissions. </description> </item> <item> <title> Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization </title> <link> http://jmlr.org/papers/v26/22-0863.html </link> <pdf> http://jmlr.org/papers/volume26/22-0863/22-0863.pdf </pdf> <pubDate>2025</pubDate> <author>Tianyi Lin, Chi Jin, Michael I. Jordan</author> <description> We provide a unified analysis of two-timescale gradient descent ascent (TTGDA) for solving structured nonconvex minimax optimization problems in the form of $\min_x \max_{y \in Y} f(x, y)$, where the objective function $f(x, y)$ is nonconvex in $x$ and concave in $y$, and the constraint set $Y \subseteq \mathbb{R}^n$ is convex and bounded. In the convex-concave setting, the single-timescale gradient descent ascent (GDA) algorithm is widely used in applications and has been shown to have strong convergence guarantees. In more general settings, however, it can fail to converge. Our contribution is to design TTGDA algorithms that are effective beyond the convex-concave setting, efficiently finding a stationary point of the function $\Phi(\cdot) := \max_{y \in Y} f(\cdot, y)$. We also establish theoretical bounds on the complexity of solving both smooth and nonsmooth nonconvex-concave minimax optimization problems. To the best of our knowledge, this is the first systematic analysis of TTGDA for nonconvex minimax optimization, shedding light on its superior performance in training generative adversarial networks (GANs) and in other real-world application problems. </description> </item> <item> <title> An Axiomatic Definition of Hierarchical Clustering </title> <link> http://jmlr.org/papers/v26/24-1052.html </link> <pdf> http://jmlr.org/papers/volume26/24-1052/24-1052.pdf </pdf> <pubDate>2025</pubDate> <author>Ery Arias-Castro, Elizabeth Coda</author> <description> In this paper, we take an axiomatic approach to defining a population hierarchical clustering for piecewise constant densities, and in a similar manner to Lebesgue integration, extend this definition to more general densities. When the density satisfies some mild conditions, e.g., when it has connected support, is continuous, and vanishes only at infinity, or when the connected components of the density satisfy these conditions, our axiomatic definition results in Hartigan's definition of cluster tree. </description> </item> <item> <title> Test-Time Training on Video Streams </title> <link> http://jmlr.org/papers/v26/24-0439.html </link> <pdf> http://jmlr.org/papers/volume26/24-0439/24-0439.pdf </pdf> <pubDate>2025</pubDate> <author>Renhao Wang, Yu Sun, Arnuv Tandon, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang</author> <description> Prior work has established Test-Time Training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is first trained on the same instance using a self-supervised task such as reconstruction. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2x and 1.5x for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This finding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over offline TTT, and analyze its role with ablations and a theory based on bias-variance trade-off. </description> </item> <item> <title> Adaptive Client Sampling in Federated Learning via Online Learning with Bandit Feedback </title> <link> http://jmlr.org/papers/v26/24-0385.html </link> <pdf> http://jmlr.org/papers/volume26/24-0385/24-0385.pdf </pdf> <pubDate>2025</pubDate> <author>Boxin Zhao, Lingxiao Wang, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Chaochao Chen, Mladen Kolar</author> <description> Due to the high cost of communication, federated learning (FL) systems need to sample a subset of clients that are involved in each round of training. As a result, client sampling plays an important role in FL systems as it affects the convergence rate of optimization algorithms used to train machine learning models. Despite its importance, there is limited work on how to sample clients effectively. In this paper, we cast client sampling as an online learning task with bandit feedback, which we solve with an online stochastic mirror descent (OSMD) algorithm designed to minimize the sampling variance. We then theoretically show how our sampling method can improve the convergence speed of federated optimization algorithms over the widely used uniform sampling. Through both simulated and real data experiments, we empirically illustrate the advantages of the proposed client sampling algorithm over uniform sampling and existing online learning-based sampling strategies. The proposed adaptive sampling procedure is applicable beyond the FL problem studied here and can be used to improve the performance of stochastic optimization procedures such as stochastic gradient descent and stochastic coordinate descent. </description> </item> <item> <title> A Random Matrix Approach to Low-Multilinear-Rank Tensor Approximation </title> <link> http://jmlr.org/papers/v26/24-0193.html </link> <pdf> http://jmlr.org/papers/volume26/24-0193/24-0193.pdf </pdf> <pubDate>2025</pubDate> <author>Hugo Lebeau, Florent Chatelain, Romain Couillet</author> <description> This work presents a comprehensive understanding of the estimation of a planted low-rank signal from a general spiked tensor model near the computational threshold. Relying on standard tools from the theory of large random matrices, we characterize the large-dimensional spectral behavior of the unfoldings of the data tensor and exhibit relevant signal-to-noise ratios governing the detectability of the principal directions of the signal. These results allow to accurately predict the reconstruction performance of truncated multilinear SVD (MLSVD) in the non-trivial regime. This is particularly important since it serves as an initialization of the higher-order orthogonal iteration (HOOI) scheme, whose convergence to the best low-multilinear-rank approximation depends entirely on its initialization. We give a sufficient condition for the convergence of HOOI and show that the number of iterations before convergence tends to $1$ in the large-dimensional limit. </description> </item> <item> <title> Memory Gym: Towards Endless Tasks to Benchmark Memory Capabilities of Agents </title> <link> http://jmlr.org/papers/v26/24-0043.html </link> <pdf> http://jmlr.org/papers/volume26/24-0043/24-0043.pdf </pdf> <pubDate>2025</pubDate> <author>Marco Pleines, Matthias Pallasch, Frank Zimmer, Mike Preuss</author> <description> Memory Gym presents a suite of 2D partially observable environments, namely Mortar Mayhem, Mystery Path, and Searing Spotlights, designed to benchmark memory capabilities in decision-making agents. These environments, originally with finite tasks, are expanded into innovative, endless formats, mirroring the escalating challenges of cumulative memory games such as “I packed my bag”. This progression in task design shifts the focus from merely assessing sample efficiency to also probing the levels of memory effectiveness in dynamic, prolonged scenarios. To address the gap in available memory-based Deep Reinforcement Learning baselines, we introduce an implementation within the open-source CleanRL library that integrates Transformer-XL (TrXL) with Proximal Policy Optimization. This approach utilizes TrXL as a form of episodic memory, employing a sliding window technique. Our comparative study between the Gated Recurrent Unit (GRU) and TrXL reveals varied performances across our finite and endless tasks. TrXL, on the finite environments, demonstrates superior effectiveness over GRU, but only when utilizing an auxiliary loss to reconstruct observations. Notably, GRU makes a remarkable resurgence in all endless tasks, consistently outperforming TrXL by significant margins. Website and Source Code: https://marcometer.github.io/jmlr_2024.github.io/ </description> </item> <item> <title> Enhancing Graph Representation Learning with Localized Topological Features </title> <link> http://jmlr.org/papers/v26/23-1424.html </link> <pdf> http://jmlr.org/papers/volume26/23-1424/23-1424.pdf </pdf> <pubDate>2025</pubDate> <author>Zuoyu Yan, Qi Zhao, Ze Ye, Tengfei Ma, Liangcai Gao, Zhi Tang, Yusu Wang, Chao Chen</author> <description> Representation learning on graphs is a fundamental problem that can be crucial in various tasks. Graph neural networks, the dominant approach for graph representation learning, are limited in their representation power. Therefore, it can be beneficial to explicitly extract and incorporate high-order topological and geometric information into these models. In this paper, we propose a principled approach to extract the rich connectivity information of graphs based on the theory of persistent homology. Our method utilizes the topological features to enhance the representation learning of graph neural networks and achieve state-of-the-art performance on various node classification and link prediction benchmarks. We also explore the option of end-to-end learning of the topological features, i.e., treating topological computation as a differentiable operator during learning. Our theoretical analysis and empirical study provide insights and potential guidelines for employing topological features in graph learning tasks. </description> </item> <item> <title> Deep Out-of-Distribution Uncertainty Quantification via Weight Entropy Maximization </title> <link> http://jmlr.org/papers/v26/23-1359.html </link> <pdf> http://jmlr.org/papers/volume26/23-1359/23-1359.pdf </pdf> <pubDate>2025</pubDate> <author>Antoine de Mathelin, François Deheeger, Mathilde Mougeot, Nicolas Vayatis</author> <description> This paper deals with uncertainty quantification and out-of-distribution detection in deep learning using Bayesian and ensemble methods. It proposes a practical solution to the lack of prediction diversity observed recently for standard approaches when used out-of-distribution (Ovadia et al., 2019; Liu et al., 2021). Considering that this issue is mainly related to a lack of weight diversity, we claim that standard methods sample in "over-restricted" regions of the weight space due to the use of "over-regularization" processes, such as weight decay and zero-mean centered Gaussian priors. We propose to solve the problem by adopting the maximum entropy principle for the weight distribution, with the underlying idea to maximize the weight diversity. Under this paradigm, the epistemic uncertainty is described by the weight distribution of maximal entropy that produces neural networks "consistent" with the training observations. Considering stochastic neural networks, a practical optimization is derived to build such a distribution, defined as a trade-off between the average empirical risk and the weight distribution entropy. We provide both theoretical and numerical results to assess the efficiency of the approach. In particular, the proposed algorithm appears in the top three best methods in all configurations of an extensive out-of-distribution detection benchmark including more than thirty competitors. </description> </item> <item> <title> DisC2o-HD: Distributed causal inference with covariates shift for analyzing real-world high-dimensional data </title> <link> http://jmlr.org/papers/v26/23-1254.html </link> <pdf> http://jmlr.org/papers/volume26/23-1254/23-1254.pdf </pdf> <pubDate>2025</pubDate> <author>Jiayi Tong, Jie Hu, George Hripcsak, Yang Ning, Yong Chen</author> <description> High-dimensional healthcare data, such as electronic health records (EHR) data and claims data, present two primary challenges due to the large number of variables and the need to consolidate data from multiple clinical sites. The third key challenge is the potential existence of heterogeneity in terms of covariate shift. In this paper, we propose a distributed learning algorithm accounting for covariate shift to estimate the average treatment effect (ATE) for high-dimensional data, named DisC2o-HD. Leveraging the surrogate likelihood method, our method calibrates the estimates of the propensity score and outcome models to approximately attain the desired covariate balancing property, while accounting for the covariate shift across multiple clinical sites. We show that our distributed covariate balancing propensity score estimator can approximate the pooled estimator, which is obtained by pooling the data from multiple sites together. The proposed estimator remains consistent if either the propensity score model or the outcome regression model is correctly specified. The semiparametric efficiency bound is achieved when both the propensity score and the outcome models are correctly specified. We conduct simulation studies to demonstrate the performance of the proposed algorithm; additionally, we conduct an empirical study to present the readiness of implementation and validity. </description> </item> <item> <title> Bayes Meets Bernstein at the Meta Level: an Analysis of Fast Rates in Meta-Learning with PAC-Bayes </title> <link> http://jmlr.org/papers/v26/23-025.html </link> <pdf> http://jmlr.org/papers/volume26/23-025/23-025.pdf </pdf> <pubDate>2025</pubDate> <author>Charles Riou, Pierre Alquier, Badr-Eddine Chérief-Abdellatif</author> <description> Bernstein's condition is a key assumption that guarantees fast rates in machine learning. For example, under this condition, the Gibbs posterior with prior $\pi$ has an excess risk in $O(d_{\pi}/n)$, as opposed to $O(\sqrt{d_{\pi}/n})$ in the general case, where $n$ denotes the number of observations and $d_{\pi}$ is a complexity parameter which depends on the prior $\pi$. In this paper, we examine the Gibbs posterior in the context of meta-learning, i.e., when learning the prior $\pi$ from $T$ previous tasks. Our main result is that Bernstein's condition always holds at the meta level, regardless of its validity at the observation level. This implies that the additional cost to learn the Gibbs prior $\pi$, which will reduce the term $d_\pi$ across tasks, is in $O(1/T)$, instead of the expected $O(1/\sqrt{T})$. We further illustrate how this result improves on the standard rates in three different settings: discrete priors, Gaussian priors and mixture of Gaussian priors. </description> </item> <item> <title> Efficiently Escaping Saddle Points in Bilevel Optimization </title> <link> http://jmlr.org/papers/v26/22-0136.html </link> <pdf> http://jmlr.org/papers/volume26/22-0136/22-0136.pdf </pdf> <pubDate>2025</pubDate> <author>Minhui Huang, Xuxing Chen, Kaiyi Ji, Shiqian Ma, Lifeng Lai</author> <description> Bilevel optimization is one of the fundamental problems in machine learning and optimization. Recent theoretical developments in bilevel optimization focus on finding the first-order stationary points for nonconvex-strongly-convex cases. In this paper, we analyze algorithms that can escape saddle points in nonconvex-strongly-convex bilevel optimization. Specifically, we show that the perturbed approximate implicit differentiation (AID) with a warm start strategy finds an $\epsilon$-approximate local minimum of bilevel optimization in $\tilde{O}(\epsilon^{-2})$ iterations with high probability. Moreover, we propose an inexact NEgative-curvature-Originated-from-Noise Algorithm (iNEON), an algorithm that can escape saddle point and find local minimum of stochastic bilevel optimization. As a by-product, we provide the first nonasymptotic analysis of perturbed multi-step gradient descent ascent (GDmax) algorithm that converges to local minimax point for minimax problems. </description> </item> </channel> </rss>