New articles on Statistics


[1] 2601.16250

Distributional Computational Graphs: Error Bounds

We study a general framework of distributional computational graphs: computational graphs whose inputs are probability distributions rather than point values. We analyze the discretization error that arises when these graphs are evaluated using finite approximations of continuous probability distributions. Such an approximation might be the result of representing a continuous real-valued distribution using a discrete representation or from constructing an empirical distribution from samples (or might be the output of another distributional computational graph). We establish non-asymptotic error bounds in terms of the Wasserstein-1 distance, without imposing structural assumptions on the computational graph.


[2] 2601.16318

Orthogonal factorial designs for trials of therapist-delivered interventions: Randomising intervention-therapist combinations to patients

It is recognised that treatment-related clustering should be allowed for in the sample size and analyses of individually-randomised parallel-group trials that evaluate therapist-delivered interventions such as psychotherapy. Here, interventions are a treatment factor, but therapists are not. If the aim of a trial is to separate effects of therapists from those of interventions, we propose that interventions and therapists should be regarded as two potentially interacting treatment factors (one fixed, one random) with a factorial structure. We consider the specific design where each therapist delivers each intervention (crossed therapist-intervention design), and the resulting therapist-intervention combinations are randomised to patients. We adopt a classical Design of Experiments (DoE) approach to propose a family of orthogonal factorial designs and their associated data analyses, which allow for therapist learning and centre too. We set out the associated data analyses using ANOVA and regression and report the results of a small simulation study conducted to explore the performance of the proposed randomisation methods in estimating the intervention effect and its standard error, the between-therapist variance and the between-therapist variance in the intervention effect. We conclude that more purposeful trial design has the potential to lead to better evidence on a range of complex interventions and outline areas for further methodological research.


[3] 2601.16340

Matrix-Response Generalized Linear Mixed Model with Applications to Longitudinal Brain Images

Longitudinal brain imaging data facilitate the monitoring of structural and functional alterations in individual brains across time, offering essential understanding of dynamic neurobiological mechanisms. Such data improve sensitivity for detecting early biomarkers of disease progression and enhance the evaluation of intervention effects. While recent matrix-response regression models can relate static brain networks to external predictors, there remain few statistical methods for longitudinal brain networks, especially those derived from high-dimensional imaging data. We introduce a matrix-response generalized linear mixed model that accommodates longitudinal brain networks and identifies edges whose connectivity is influenced by external predictors. An efficient Monte Carlo Expectation-Maximization algorithm is developed for parameter estimation. Extensive simulations demonstrate effective identification of covariate-related network components and accurate parameter estimation. We further demonstrate the usage of the proposed method through applications to diffusion tensor imaging (DTI) and functional MRI (fMRI) datasets.


[4] 2601.16347

Long-Term Probabilistic Forecast of Vegetation Conditions Using Climate Attributes in the Four Corners Region

Weather conditions can drastically alter the state of crops and rangelands, and in turn, impact the incomes and food security of individuals worldwide. Satellite-based remote sensing offers an effective way to monitor vegetation and climate variables on regional and global scales. The annual peak Normalized Difference Vegetation Index (NDVI), derived from satellite observations, is closely associated with crop development, rangeland biomass, and vegetation growth. Although various machine learning methods have been developed to forecast NDVI over short time ranges, such as one-month-ahead predictions, long-term forecasting approaches, such as one-year-ahead predictions of vegetation conditions, are not yet available. To fill this gap, we develop a two-phase machine learning model to forecast the one-year-ahead peak NDVI over high-resolution grids, using the Four Corners region of the Southwestern United States as a testbed. In phase one, we identify informative climate attributes, including precipitation and maximum vapor pressure deficit, and develop the generalized parallel Gaussian process that captures the relationship between climate attributes and NDVI. In phase two, we forecast these climate attributes using historical data at least one year before the NDVI prediction month, which then serve as inputs to forecast the peak NDVI at each spatial grid. We developed open-source tools that outperform alternative methods for both gross NDVI and grid-based NDVI one-year forecasts, providing information that can help farmers and ranchers make actionable plans a year in advance.


[5] 2601.16364

Hybrid Partial Least Squares Regression with Multiple Functional and Scalar Predictors

Motivated by renal imaging studies that combine renogram curves with pharmacokinetic and demographic covariates, we propose Hybrid partial least squares (Hybrid PLS) for simultaneous supervised dimension reduction and regression in the presence of cross-modality correlations. The proposed approach embeds multiple functional and scalar predictors into a unified hybrid Hilbert space and rigorously extends the nonlinear iterative PLS (NIPALS) algorithm. This theoretical development is complemented by a sample-level algorithm that incorporates roughness penalties to control smoothness. By exploiting the rank-one structure of the resulting optimization problem, the algorithm admits a computationally efficient closed-form solution that requires solving only linear systems at each iteration. We establish fundamental geometric properties of the proposed framework, including orthogonality of the latent scores and PLS directions. Extensive numerical studies on synthetic data, together with an application to a renal imaging study, validate these theoretical results and demonstrate the method's ability to recover predictive structure under intermodal multicollinearity, yielding parsimonious low-dimensional representations.


[6] 2601.16368

Inference for competing risks based on area between curves statistics

In competing risks models, cumulative incidence functions are commonly compared to infer differences between groups. Many existing inference methods, however, struggle when these functions cross during the time frame of interest. To address this problem, we investigate a test statistic based on the area between cumulative incidence functions. As the corresponding limiting distribution depends on quantities that are typically unknown, we propose a wild bootstrap approach to obtain a feasible and asymptotically valid two-sample test. The finite sample performance of the proposed method, in comparison with existing methods, is examined in an extensive simulation study.


[7] 2601.16385

Spherical Spatial Autoregressive Model for Spherically Embedded Spatial Data

Spherically embedded spatial data are spatially indexed observations whose values naturally reside on or can be equivalently mapped to the unit sphere. Such data are increasingly ubiquitous in fields ranging from geochemistry to demography. However, analysing such data presents unique difficulties due to the intrinsic non-Euclidean nature of the sphere, and rigorous methodologies for statistical modelling, inference, and uncertainty quantification remain limited. This paper introduces a unified framework to address these three limitations for spherically embedded spatial data. We first propose a novel spherical spatial autoregressive model that leverages optimal transport geometry and then extend it to accommodate exogenous covariates. Second, for either scenario with or without covariates, we establish the asymptotic properties of the estimators and derive a distribution-free Wald test for spatial dependence, complemented by a bootstrap procedure to enhance finite-sample performance. Third, we contribute a novel approach to uncertainty quantification by developing a conformal prediction procedure specifically tailored to spherically embedded spatial data. The practical utility of these methodological advances is illustrated through extensive simulations and applications to Spanish geochemical compositions and Japanese age-at-death mortality distributions.


[8] 2601.16427

Perfect Clustering for Sparse Directed Stochastic Block Models

Exact recovery in stochastic block models (SBMs) is well understood in undirected settings, but remains considerably less developed for directed and sparse networks, particularly when the number of communities diverges. Spectral methods for directed SBMs often lack stability in asymmetric, low-degree regimes, and existing non-spectral approaches focus primarily on undirected or dense settings. We propose a fully non-spectral, two-stage procedure for community detection in sparse directed SBMs with potentially growing numbers of communities. The method first estimates the directed probability matrix using a neighborhood-smoothing scheme tailored to the asymmetric setting, and then applies $K$-means clustering to the estimated rows, thereby avoiding the limitations of eigen- or singular value decompositions in sparse, asymmetric networks. Our main theoretical contribution is a uniform row-wise concentration bound for the smoothed estimator, obtained through new arguments that control asymmetric neighborhoods and separate in- and out-degree effects. These results imply the exact recovery of all community labels with probability tending to one, under mild sparsity and separation conditions that allow both $\gamma_n \to 0$ and $K_n \to \infty$. Simulation studies, including highly directed, sparse, and non-symmetric block structures, demonstrate that the proposed procedure performs reliably in regimes where directed spectral and score-based methods deteriorate. To the best of our knowledge, this provides the first exact recovery guarantee for this class of non-spectral, neighborhood-smoothing methods in the sparse, directed setting.


[9] 2601.16431

Sequential Experimental Designs for Kriging Model

Computer experiments have become an indispensable alternative to complex physical and engineering experiments. The Kriging model is the most widely used surrogate model, with the core goal of minimizing the discrepancy between the surrogate and true models across the entire experimental domain. However, existing sequential design methods have critical limitations: observation-based batch sequential designs are rarely studied, while one-point sequential designs have insufficient information utilization and suffer from inefficient resource utilization -- they require numerous repeated observation rounds to accumulate sufficient points, leading to prolonged experimental cycles. To address these gaps, this paper proposes two novel one-point sequential design criteria and a general batch sequential design framework. Moreover, the batch sequential design framework solves the inherent point clustering problem in naive batch selection, enabling efficient extension of any sequential criterion to batch scenarios. Simulations on some test functions demonstrate that the proposed methods outperform existing approaches in terms of fitting accuracy in most cases.


[10] 2601.16470

Variational Dimension Lifting for Robust Tracking of Nonlinear Stochastic Dynamics

Nonlinear stochastic motion presents significant challenges for Bayesian particle tracking. To address this challenge, this paper proposes a framework to construct an invertible transformation that maps the nonlinear state-space model (SSM) into a higher-dimensional linear Gaussian SSM. This approach allows the application of standard linear-Gaussian inference techniques while maintaining a connection to the dynamics of the original system. The paper derives the necessary conditions for such transformations using Ito's lemma and variational calculus, and illustrates the method on a bistable cubic motion model, radial Brownian process model, and a logistic model with multiplicative noise. Simulations confirm that the transformed linear systems, when projected back, accurately reconstruct the nonlinear dynamics and, in distinct regimes of stiffness and singularity, yield tracking accuracy competitive with conventional filters, while avoiding their structural instabilities.


[11] 2601.16561

Markov Stick-breaking Processes

Stick-breaking has a long history and is one of the most popular procedures for constructing random discrete distributions in Statistics and Machine Learning. In particular, due to their intuitive construction and computational tractability they are ubiquitous in modern Bayesian nonparametric inference. Most widely used models, such as the Dirichlet and the Pitman-Yor processes, rely on iid or independent length variables. Here we pursue a completely unexplored research direction by considering Markov length variables and investigate the corresponding general class of stick-breaking processes, which we term Markov stick-breaking processes. We establish conditions under which the associated species sampling process is proper and the distribution of a Markov stick-breaking process has full topological support, two fundamental desiderata for Bayesian nonparametric models. We also analyze the stochastic ordering of the weights and provide a new characterization of the Pitman-Yor process as the only stick-breaking process invariant under size-biased permutations, under mild conditions. Moreover, we identify two notable subclasses of Markov stick-breaking processes that enjoy appealing properties and include Dirichlet, Pitman-Yor and Geometric priors as special cases. Our findings include distributional results enabling posterior inference algorithms and methodological insights.


[12] 2601.16585

Variational approximate penalized credible regions for Bayesian grouped regression

We develop a fast and accurate grouped penalized credible region approach for variable selection and prediction in Bayesian high-dimensional linear regression. Most existing Bayesian methods either are subject to high computational costs due to long Markov Chain Monte Carlo runs or yield ambiguous variable selection results due to non-sparse solution output. The penalized credible region framework yields sparse post-processed estimates that facilitates unambiguous grouped variable selection. High estimation accuracy is achieved by shrinking noise from unimportant groups using a grouped global-local shrinkage prior. To ensure computational scalability, we approximate posterior summaries using coordinate ascent variational inference and recast the penalized credible region framework as a convex optimization problem that admits efficient computations. We prove that the resultant post-processed estimators are both parameter-consistent and variable selection consistent in high-dimensional settings. Theory is developed to justify running the coordinate ascent algorithm for at least two cycles. Through extensive simulations, we demonstrate that our proposed method outperforms state-of-the-art methods in grouped variable selection, prediction, and computation time for several common models including ANOVA and nonparametric varying coefficient models.


[13] 2601.16595

Bayesian Nonparametric Causal Inference for High-Dimensional Nutritional Data via Factor-Based Exposure Mapping

Diet plays a crucial role in health, and understanding the causal effects of dietary patterns is essential for informing public health policy and personalized nutrition strategies. However, causal inference in nutritional epidemiology faces several challenges: (i) high-dimensional and correlated food/nutrient intake data induce massive treatment levels; (ii) nutritional studies are interested in latent dietary patterns rather than single food items; and (iii) the goal is to estimate heterogeneous causal effects of these dietary patterns on health outcomes. We address these challenges by introducing a sophisticated exposure mapping framework that reduces the high-dimensional treatment space via factor analysis and enables the identification of dietary patterns. We also extend the Bayesian Causal Forest to accommodate three ordered levels of dietary exposure, better capturing the complex structure of nutritional data and enabling estimation of heterogeneous causal effects. We evaluate the proposed method through extensive simulations and apply it to a multi-center epidemiological study of Hispanic/Latino adults residing in the US. Using high-dimensional dietary data, we identify six dietary patterns and estimate their causal link with two key health risk factors: body mass index and fasting insulin levels. Our findings suggest that higher consumption of plant lipid-antioxidant, plant-based, animal protein, and dairy product patterns is associated with reduced risk.


[14] 2601.16597

Efficient Learning of Stationary Diffusions with Stein-type Discrepancies

Learning a stationary diffusion amounts to estimating the parameters of a stochastic differential equation whose stationary distribution matches a target distribution. We build on the recently introduced kernel deviation from stationarity (KDS), which enforces stationarity by evaluating expectations of the diffusion's generator in a reproducing kernel Hilbert space. Leveraging the connection between KDS and Stein discrepancies, we introduce the Stein-type KDS (SKDS) as an alternative formulation. We prove that a vanishing SKDS guarantees alignment of the learned diffusion's stationary distribution with the target. Furthermore, under broad parametrizations, SKDS is convex with an empirical version that is $\epsilon$-quasiconvex with high probability. Empirically, learning with SKDS attains comparable accuracy to KDS while substantially reducing computational cost and yields improvements over the majority of competitive baselines.


[15] 2601.16636

Conformal prediction for full and sparse polynomial chaos expansions

Polynomial Chaos Expansions (PCEs) are widely recognized for their efficient computational performance in surrogate modeling. Yet, a robust framework to quantify local model errors is still lacking. While the local uncertainty of PCE prediction can be captured using bootstrap resampling, other methods offering more rigorous statistical guarantees are needed, especially in the context of small training datasets. Recently, conformal predictions have demonstrated strong potential in machine learning, providing statistically robust and model-agnostic prediction intervals. Due to its generality and versatility, conformal prediction is especially valuable, as it can be adapted to suit a variety of problems, making it a compelling choice for PCE-based surrogate models. In this contribution, we explore its application to PCE-based surrogate models. More precisely, we present the integration of two conformal prediction methods, namely the full conformal and the Jackknife+ approaches, into both full and sparse PCEs. For full PCEs, we introduce computational shortcuts inspired by the inherent structure of regression methods to optimize the implementation of both conformal methods. For sparse PCEs, we incorporate the two approaches with appropriate modifications to the inference strategy, thereby circumventing the non-symmetrical nature of the regression algorithm and ensuring valid prediction intervals. Our developments yield better-calibrated prediction intervals for both full and sparse PCEs, achieving superior coverage over existing approaches, such as the bootstrap, while maintaining a moderate computational cost.


[16] 2601.16684

Asymptotic testing of covariance separability for matrix elliptical data

We propose a new asymptotic test for the separability of a covariance matrix. The null distribution is valid in wide matrix elliptical model that includes, in particular, both matrix Gaussian and matrix $t$-distribution. The test is fast to compute and makes no assumptions about the component covariance matrices. An alternative, Wald-type version of the test is also proposed. Our simulations reveal that both versions of the test have good power even for heavier-tailed distributions and can compete with the Gaussian likelihood ratio test in the case of normal data.


[17] 2601.16696

Faster parallel MCMC: Metropolis adjustment is best served warm

Despite the enormous success of Hamiltonian Monte Carlo and related Markov Chain Monte Carlo (MCMC) methods, sampling often still represents the computational bottleneck in scientific applications. Availability of parallel resources can significantly speed up MCMC inference by running a large number of chains in parallel, each collecting a single sample. However, the parallel approach converges slowly if the chains are not initialized close to the target distribution (cold start). Theoretically this can be resolved by initially running MCMC without Metropolis-Hastings adjustment to quickly converge to the vicinity of the target distribution and then turn on adjustment to achieve fine convergence. However, no practical scheme uses this strategy, due to the difficulty of automatically selecting the step size during the unadjusted phase. We here develop Late Adjusted Parallel Sampler (LAPS), which is precisely such a scheme and is applicable out of the box, all the hyperparameters are selected automatically. LAPS takes advantage of ensemble-based hyperparameter adaptation to estimate the bias at each iteration and converts it to the appropriate step size. We show that LAPS consistently and significantly outperforms ensemble adjusted methods such as MEADS or ChESS and the optimization-based initializer Pathfinder on a variety of standard benchmark problems. LAPS typically achieves two orders of magnitude lower wall-clock time than the corresponding sequential algorithms such as NUTS.


[18] 2601.16739

Comments on "Challenges of cellwise outliers" by Jakob Raymaekers and Peter J. Rousseeuw

The main aim of robust statistics is the development of methods able to cope with the presence of outliers. A new type of outliers, namely "cellwise", has garnered considerable attention. The state of the art for dealing with cellwise contamination in different models is presented in Raymaekers and Rousseeuw (2024). Outliers in time series can be treated as cellwise outliers, a further discussion on this subject is presented.


[19] 2601.16749

Finite Population Inference for Factorial Designs and Panel Experiments with Imperfect Compliance

This paper develops a finite population framework for analyzing causal effects in settings with imperfect compliance where multiple treatments affect the outcome of interest. Two prominent examples are factorial designs and panel experiments with imperfect compliance. I define finite population causal effects that capture the relative effectiveness of alternative treatment sequences. I provide nonparametric estimators for a rich class of factorial and dynamic causal effects and derive their finite population distributions as the sample size increases. Monte Carlo simulations illustrate the desirable properties of the estimators. Finally, I use the estimator for causal effects in factorial designs to revisit a famous voter mobilization experiment that analyzes the effects of voting encouragement through phone calls on turnout.


[20] 2601.16769

From Noisy News Sentiment Scores to Interpretable Temporal Dynamics: A Bayesian State-Space Model

Text-based sentiment indicators are widely used to monitor public and market mood, but weekly sentiment series are noisy by construction. A main reason is that the amount of relevant news changes over time and across categories. As a result, some weekly averages are based on many articles, while others rely on only a few. Existing approaches do not explicitly account for changes in data availability when measuring uncertainty. We present a Bayesian state-space framework that turns aggregated news sentiment into a smoothed time series with uncertainty. The model treats each weekly sentiment value as a noisy measurement of an underlying sentiment process, with observation uncertainty scaled by the effective information weight $n_{tj}$: when coverage is high, latent sentiment is anchored more strongly to the observed aggregate; when coverage is low, inference relies more on the latent dynamics and uncertainty increases. Using news data grouped into multiple categories, we find broadly similar latent dynamics across categories, while larger differences appear in observation noise. The framework is designed for descriptive monitoring and can be extended to other text sources where information availability varies over time.


[21] 2601.16777

Kernel smoothing on manifolds

Under the assumption that data lie on a compact (unknown) manifold without boundary, we derive finite sample bounds for kernel smoothing and its (first and second) derivatives, and we establish asymptotic normality through Berry-Esseen type bounds. Special cases include kernel density estimation, kernel regression and the heat kernel signature. Connections to the graph Laplacian are also discussed.


[22] 2601.16784

Spectral embedding of inhomogeneous Poisson processes on multiplex networks

In many real-world networks, data on the edges evolve in continuous time, naturally motivating representations based on point processes. Heterogeneity in edge types further gives rise to multiplex network point processes. In this work, we propose a model for multiplex network data observed in continuous-time. We establish two-to-infinity norm consistency and asymptotic normality for spectral-embedding-based estimation of the model parameters as both network size and time resolution increase. Drawing inspiration from random dot product graph models, each edge intensity is expressed as the inner product of two low-dimensional latent positions: one dynamic and layer-agnostic, the other static and layer-dependent. These latent positions constitute the primary objects of inference, which is conducted via spectral embedding methods. Our theoretical results are established under a histogram estimator of the network intensities and provide justification for applying a doubly unfolded adjacency spectral embedding method for estimation. Simulations and real-data analyses demonstrate the effectiveness of the proposed model and inference procedure.


[23] 2601.16813

A Fully Automated DM-BIM-BEM Pipeline Enabling Graph-Based Intelligence, Interoperability, and Performance-Driven Early Design

Artificial intelligence in construction increasingly depends on structured representations such as Building Information Models and knowledge graphs, yet early-stage building designs are predominantly created as flexible boundary-representation (B-rep) models that lack explicit spatial, semantic, and performance structure. This paper presents a robust, fully automated framework that transforms unstructured B-rep geometry into knowledge-graph-based Building Information Models and further into executable Building Energy Models. The framework enables artificial intelligence to explicitly interpret building elements, spatial topology, and their associated thermal and performance attributes. It integrates automated geometry cleansing, multiple auto space-generation strategies, graph-based extraction of space and element topology, ontology-aligned knowledge modeling, and reversible transformation between ontology-based BIM and EnergyPlus energy models. Validation on parametric, sketch-based, and real-world building datasets demonstrates high robustness, consistent topological reconstruction, and reliable performance-model generation. By bridging design models, BIM, and BEM, the framework provides an AI-oriented infrastructure that extends BIM- and graph-based intelligence pipelines to flexible early-stage design geometry, enabling performance-driven design exploration and optimization by learning-based methods.


[24] 2601.16821

Directional-Shift Dirichlet ARMA Models for Compositional Time Series with Structural Break Intervention

Compositional time series, vectors of proportions summing to unity observed over time, frequently exhibit structural breaks due to external shocks, policy changes, or market disruptions. Standard methods either ignore such breaks or handle them through ad-hoc dummy variables that cannot extrapolate beyond the estimation sample. We develop a Bayesian Dirichlet ARMA model augmented with a directional-shift intervention mechanism that captures structural breaks through three interpretable parameters: a unit direction vector specifying which components gain or lose share, an amplitude controlling the magnitude of redistribution, and a logistic gate governing the timing and speed of transition. The model preserves compositional constraints by construction, maintains innovation-form DARMA dynamics for short-run dependence, and produces coherent probabilistic forecasts during and after structural breaks. We establish that the directional shift corresponds to geodesic motion on the simplex and is invariant to the choice of ILR basis. A comprehensive simulation study with 400 fits across 8 scenarios demonstrates that when the shift direction is correctly identified (77.5% of cases), amplitude and timing parameters are recovered with near-zero bias, and credible intervals for the mean composition achieve nominal 80% coverage; we address the sign identification challenge through a hemisphere constraint. An empirical application to fee recognition lead-time distributions during COVID-19 compares baseline, fixed-effects, and intervention specifications in rolling forecast evaluation, demonstrating the intervention model's superior point accuracy (Aitchison distance 0.83 vs. 0.90) and calibration (87% vs. 71% coverage) during structural transitions.


[25] 2601.16829

Directional Asymmetry in Edge BasedSpatial Models via a Skew Normal Prior

We introduce a skewed edge based spatial prior, named RENeGe sk that extends the Gaussian RENeGe framework by incorporating directional asymmetry through a skew normal distribution. Skewness is defined on the edge graph and propagated to the node space, aligning asymmetric behavior with transitions across neighboring regions rather than with marginal node effects. The model is formulated within the skew normal framework and employs identifiable hierarchical priors together with low rank parameterizations to ensure scalability. The skew normal's stochastic representation is considered to facilitate the computational implementation. Simulation studies show that RENeGe sk recovers compact, edge-aligned directional structure more accurately than symmetric Gaussian priors, while remaining competitive under irregular spatial patterns. An application to cancer incidence data in Southern Brazil illustrates how the proposed approach yields stable area-level estimates while preserving localized, directionally driven spatial variation.


[26] 2601.16837

Spillovers and Co-movements in Multivariate Volatility: A Vector Multiplicative Error Model

Recent developments in financial time series focus on modeling volatility across multiple assets or indices in a multivariate framework, accounting for potential interactions such as spillover effects. Furthermore, the increasing integration of global financial markets provides a similar dynamics (referred to as comovement). In this context, we introduce a novel model for volatility vectors within the Multiplicative Error Model (MEM) class. This framework accommodates both spillover and co-movement effects through a distinct latent component. By adopting a specific parameterization, the model remains computationally feasible even for high-dimensional volatility vectors. To reduce the number of unknown coefficients, we propose a simple model-based clustering procedure. We illustrate the effectiveness of the proposed approach through an empirical application to 29 assets of the Dow Jones Industrial Average index, providing insight into volatility spillovers and shared market dynamics. Comparative analysis against alternative vector MEMs, including a fully parameterized version of the proposed model, demonstrates its superior or at least comparable performance across multiple evaluation criteria.


[27] 2601.16842

Parametric Mean-Field empirical Bayes in high-dimensional linear regression

In this paper, we consider the problem of parametric empirical Bayes estimation of an i.i.d. prior in high-dimensional Bayesian linear regression, with random design. We obtain the asymptotic distribution of the variational Empirical Bayes (vEB) estimator, which approximately maximizes a variational lower bound of the intractable marginal likelihood. We characterize a sharp phase transition behavior for the vEB estimator -- namely that it is information theoretically optimal (in terms of limiting variance) up to $p=o(n^{2/3})$ while it suffers from a sub-optimal convergence rate in higher dimensions. In the first regime, i.e., when $p=o(n^{2/3})$, we show how the estimated prior can be calibrated to enable valid coordinate-wise and delocalized inference, both under the \emph{empirical Bayes posterior} and the oracle posterior. In the second regime, we propose a debiasing technique as a way to improve the performance of the vEB estimator beyond $p=o(n^{2/3})$. Extensive numerical experiments corroborate our theoretical findings.


[28] 2601.16932

Identifying heat-related diagnoses in emergency department visits among adults in Chicago: a heat-wide association study

Extreme heat is an escalating public health concern. Although prior studies have examined heat-health associations, their reliance on restricted diagnoses and diagnostic categories misses or misclassifies heat-related illness. We conducted a heat-wide association study to identify acute-care diagnoses associated with extreme heat in Chicago, Illinois. Using 916,904 acute-care visits -- including emergency department and urgent care encounters -- among 372,140 adults across five healthcare systems from 2011-2023, we applied a two-stage analytic approach: quasi-Poisson regression to screen 1,803 diagnosis codes for heat-related risks, followed by distributed lag non-linear models in a time-stratified case-crossover design to refine the list of heat-related diagnoses and estimate same-day and short-term cumulative odds ratios of acute-care visits during extreme heat versus reference temperature. We observed same-day increases in visits for heat illness, volume depletion, hypotension, edema, acute kidney failure, and multiple injuries. By analyzing the full diagnostic spectrum of acute-care services, this study comprehensively characterizes heat-associated morbidity, reinforcing and advancing existing literature.


[29] 2601.16945

A new class of colored Gaussian graphical models with explicit normalizing constants

We study Bayesian model selection in colored Gaussian graphical models (CGGMs), which combine sparsity of conditional independencies with symmetry constraints encoded by vertex- and edge-colored graphs. A computational bottleneck in Bayesian inference for CGGMs is the evaluation of Diaconis-Ylvisaker normalizing constants, given by gamma-type integrals over cones of precision matrices with prescribed zeros and equality constraints. While explicit formulas are known for standard Gaussian graphical models only in special cases (e.g. decomposable graphs) and for a limited class of RCOP models, no general tractable framework has been available for broader families of CGGMs. We introduce a new subclass of RCON models for which these normalizing constants admit closed-form expressions. On the algebraic side, we identify conditions on spaces of colored precision matrices that guarantee tractability of the associated integrals, leading to Block-Cholesky spaces (BC-spaces) and Diagonally Commutative Block-Cholesky spaces (DCBC-spaces). On the combinatorial side, we characterize the colored graphs inducing such spaces via a color perfect elimination ordering and a 2-path regularity condition, and define the resulting Color Elimination-Regular (CER) graphs and their symmetric variants. This class strictly extends decomposable graphs in the uncolored setting and contains all RCOP models associated with decomposable graphs. In the one-color case, our framework reveals a close connection between DCBC-spaces and Bose-Mesner algebras. For models defined on BC- and DCBC-spaces, we derive explicit closed-form formulas for the normalizing constants in terms of a finite collection of structure constants and propose an efficient method for computing them in the commutative case. Our results broaden the range of CGGMs amenable to principled Bayesian structure learning in high-dimensional applications.


[30] 2601.16220

Towards Latent Diffusion Suitable For Text

Language diffusion models aim to improve sampling speed and coherence over autoregressive LLMs. We introduce Neural Flow Diffusion Models for language generation, an extension of NFDM that enables the straightforward application of continuous diffusion models to discrete state spaces. NFDM learns a multivariate forward process from the data, ensuring that the forward process and generative trajectory are a good fit for language modeling. Our model substantially reduces the likelihood gap with autoregressive models of the same size, while achieving sample quality comparable to that of previous latent diffusion models.


[31] 2601.16631

PanopMamba: Vision State Space Modeling for Nuclei Panoptic Segmentation

Nuclei panoptic segmentation supports cancer diagnostics by integrating both semantic and instance segmentation of different cell types to analyze overall tissue structure and individual nuclei in histopathology images. Major challenges include detecting small objects, handling ambiguous boundaries, and addressing class imbalance. To address these issues, we propose PanopMamba, a novel hybrid encoder-decoder architecture that integrates Mamba and Transformer with additional feature-enhanced fusion via state space modeling. We design a multiscale Mamba backbone and a State Space Model (SSM)-based fusion network to enable efficient long-range perception in pyramid features, thereby extending the pure encoder-decoder framework while facilitating information sharing across multiscale features of nuclei. The proposed SSM-based feature-enhanced fusion integrates pyramid feature networks and dynamic feature enhancement across different spatial scales, enhancing the feature representation of densely overlapping nuclei in both semantic and spatial dimensions. To the best of our knowledge, this is the first Mamba-based approach for panoptic segmentation. Additionally, we introduce alternative evaluation metrics, including image-level Panoptic Quality ($i$PQ), boundary-weighted PQ ($w$PQ), and frequency-weighted PQ ($fw$PQ), which are specifically designed to address the unique challenges of nuclei segmentation and thereby mitigate the potential bias inherent in vanilla PQ. Experimental evaluations on two multiclass nuclei segmentation benchmark datasets, MoNuSAC2020 and NuInsSeg, demonstrate the superiority of PanopMamba for nuclei panoptic segmentation over state-of-the-art methods. Consequently, the robustness of PanopMamba is validated across various metrics, while the distinctiveness of PQ variants is also demonstrated. Code is available at this https URL.


[32] 2601.16775

LLM-powered Real-time Patent Citation Recommendation for Financial Technologies

Rapid financial innovation has been accompanied by a sharp increase in patenting activity, making timely and comprehensive prior-art discovery more difficult. This problem is especially evident in financial technologies, where innovations develop quickly, patent collections grow continuously, and citation recommendation systems must be updated as new applications arrive. Existing patent retrieval and citation recommendation methods typically rely on static indexes or periodic retraining, which limits their ability to operate effectively in such dynamic settings. In this study, we propose a real-time patent citation recommendation framework designed for large and fast-changing financial patent corpora. Using a dataset of 428,843 financial patents granted by the China National Intellectual Property Administration (CNIPA) between 2000 and 2024, we build a three-stage recommendation pipeline. The pipeline uses large language model (LLM) embeddings to represent the semantic content of patent abstracts, applies efficient approximate nearest-neighbor search to construct a manageable candidate set, and ranks candidates by semantic similarity to produce top-k citation recommendations. In addition to improving recommendation accuracy, the proposed framework directly addresses the dynamic nature of patent systems. By using an incremental indexing strategy based on hierarchical navigable small-world (HNSW) graphs, newly issued patents can be added without rebuilding the entire index. A rolling day-by-day update experiment shows that incremental updating improves recall while substantially reducing computational cost compared with rebuild-based indexing. The proposed method also consistently outperforms traditional text-based baselines and alternative nearest-neighbor retrieval approaches.


[33] 2601.16830

Uncertainty propagation through trained multi-layer perceptrons: Exact analytical results

We give analytical results for propagation of uncertainty through trained multi-layer perceptrons (MLPs) with a single hidden layer and ReLU activation functions. More precisely, we give expressions for the mean and variance of the output when the input is multivariate Gaussian. In contrast to previous results, we obtain exact expressions without resort to a series expansion.


[34] 2601.16848

Stochastic Modeling and Resource Dimensioning of Multi-Cellular Edge Intelligent Systems

Edge intelligence enables AI inference at the network edge, co-located with or near the radio access network, rather than in centralized clouds or on mobile devices. It targets low-latency, resource-constrained applications with large data volumes, requiring tight integration of wireless access and on-site computing. Yet system performance and cost-efficiency hinge on joint pre-deployment dimensioning of radio and computational resources, especially under spatial and temporal uncertainty. Prior work largely emphasizes run-time allocation or relies on simplified models that decouple radio and computing, missing end-to-end correlations in large-scale deployments. This paper introduces a unified stochastic framework to dimension multi-cell edge-intelligent systems. We model network topology with Poisson point processes, capturing random user and base-station locations, inter-cell interference, distance-based fractional power control, and peak-power constraints. By combining this with queueing theory and empirical AI inference workload profiling, we derive tractable expressions for end-to-end offloading delay. These enable a non-convex joint optimization that minimizes deployment cost under statistical QoS guarantees, expressed through strict tail-latency and inference-accuracy constraints. We prove the problem decomposes into convex subproblems, yielding global optimality. Numerical results in noise- and interference-limited regimes identify cost-efficient design regions and configurations that cause under-utilization or user unfairness. Smaller cells reduce transmission delay but raise per-request computing cost due to weaker server multiplexing, whereas larger cells show the opposite trend. Densification reduces computational costs only when frequency reuse scales with base-station density; otherwise, sparser deployments improve fairness and efficiency in interference-limited settings.


[35] 2601.16865

Distributional Instruments: Identification and Estimation with Quantile Least Squares

We study instrumental-variable designs where policy reforms strongly shift the distribution of an endogenous variable but only weakly move its mean. We formalize this by introducing distributional relevance: instruments may be purely distributional. Within a triangular model, distributional relevance suffices for nonparametric identification of average structural effects via a control function. We then propose Quantile Least Squares (Q-LS), which aggregates conditional quantiles of X given Z into an optimal mean-square predictor and uses this projection as an instrument in a linear IV estimator. We establish consistency, asymptotic normality, and the validity of standard 2SLS variance formulas, and we discuss regularization across quantiles. Monte Carlo designs show that Q-LS delivers well-centered estimates and near-correct size when mean-based 2SLS suffers from weak instruments. In Health and Retirement Study data, Q-LS exploits Medicare Part D-induced distributional shifts in out-of-pocket risk to sharpen estimates of its effects on depression.


[36] 2601.16884

Multigrade Neural Network Approximation

We study multigrade deep learning (MGDL) as a principled framework for structured error refinement in deep neural networks. While the approximation power of neural networks is now relatively well understood, training very deep architectures remains challenging due to highly non-convex and often ill-conditioned optimization landscapes. In contrast, for relatively shallow networks, most notably one-hidden-layer $\texttt{ReLU}$ models, training admits convex reformulations with global guarantees, motivating learning paradigms that improve stability while scaling to depth. MGDL builds upon this insight by training deep networks grade by grade: previously learned grades are frozen, and each new residual block is trained solely to reduce the remaining approximation error, yielding an interpretable and stable hierarchical refinement process. We develop an operator-theoretic foundation for MGDL and prove that, for any continuous target function, there exists a fixed-width multigrade $\texttt{ReLU}$ scheme whose residuals decrease strictly across grades and converge uniformly to zero. To the best of our knowledge, this work provides the first rigorous theoretical guarantee that grade-wise training yields provable vanishing approximation error in deep networks. Numerical experiments further illustrate the theoretical results.


[37] 2601.16897

FedSGM: A Unified Framework for Constraint Aware, Bidirectionally Compressed, Multi-Step Federated Optimization

We introduce FedSGM, a unified framework for federated constrained optimization that addresses four major challenges in federated learning (FL): functional constraints, communication bottlenecks, local updates, and partial client participation. Building on the switching gradient method, FedSGM provides projection-free, primal-only updates, avoiding expensive dual-variable tuning or inner solvers. To handle communication limits, FedSGM incorporates bi-directional error feedback, correcting the bias introduced by compression while explicitly understanding the interaction between compression noise and multi-step local updates. We derive convergence guarantees showing that the averaged iterate achieves the canonical $\boldsymbol{\mathcal{O}}(1/\sqrt{T})$ rate, with additional high-probability bounds that decouple optimization progress from sampling noise due to partial participation. Additionally, we introduce a soft switching version of FedSGM to stabilize updates near the feasibility boundary. To our knowledge, FedSGM is the first framework to unify functional constraints, compression, multiple local updates, and partial client participation, establishing a theoretically grounded foundation for constrained federated learning. Finally, we validate the theoretical guarantees of FedSGM via experimentation on Neyman-Pearson classification and constrained Markov decision process (CMDP) tasks.


[38] 2601.16922

Group-realizable multi-group learning by minimizing empirical risk

The sample complexity of multi-group learning is shown to improve in the group-realizable setting over the agnostic setting, even when the family of groups is infinite so long as it has finite VC dimension. The improved sample complexity is obtained by empirical risk minimization over the class of group-realizable concepts, which itself could have infinite VC dimension. Implementing this approach is also shown to be computationally intractable, and an alternative approach is suggested based on improper learning.


[39] 2601.16959

Probabilistic Graphical Models in Astronomy

The field of astronomy is experiencing a data explosion driven by significant advances in observational instrumentation, and classical methods often fall short of addressing the complexity of modern astronomical datasets. Probabilistic graphical models offer powerful tools for uncovering the dependence structures and data-generating processes underlying a wide array of cosmic variables. By representing variables as nodes in a network, these models allow for the visualization and analysis of the intricate relationships that underpin theories of hierarchical structure formation within the universe. We highlight the value that graphical models bring to astronomical research by demonstrating their practical application to the study of exoplanets and host stars.


[40] 2601.16979

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($\lambda_{\max}^H$) -- the largest eigenvalue of the loss Hessian -- determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($\lambda_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $\Delta \mathbf{\theta}$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($\lambda_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.


[41] 2305.01240

On the convergence of PINNs

Physics-informed neural networks (PINNs) are a promising approach that combines the power of neural networks with the interpretability of physical modeling. PINNs have shown good practical performance in solving partial differential equations (PDEs) and in hybrid modeling scenarios, where physical models enhance data-driven approaches. However, it is essential to establish their theoretical properties in order to fully understand their capabilities and limitations. In this study, we highlight that classical training of PINNs can suffer from systematic overfitting. This problem can be addressed by adding a ridge regularization to the empirical risk, which ensures that the resulting estimator is risk-consistent for both linear and nonlinear PDE systems. However, the strong convergence of PINNs to a solution satisfying the physical constraints requires a more involved analysis using tools from functional analysis and calculus of variations. In particular, for linear PDE systems, an implementable Sobolev-type regularization allows to reconstruct a solution that not only achieves statistical accuracy but also maintains consistency with the underlying physics.


[42] 2405.07102

Nested Instrumental Variables Analysis: Switcher Average Treatment Effect, Identification, Efficient Estimation and Generalizability

Instrumental variables (IVs) are widely used to estimate causal effects from non-randomized data. A canonical example is a randomized trial with noncompliance, in which the randomized treatment assignment serves as an IV for the non-ignorable treatment received. Under a monotonicity assumption, a valid IV nonparametrically identifies the average treatment effect among a latent complier subgroup, whose generalizability is often under debate. In many studies, there exist multiple versions of an IV, for instance, different nudges to take the same treatment in different study sites in a multicenter clinical trial. These different versions of an IV may result in different compliance rates and offer a unique opportunity to study IV estimates' generalizability. In this article, we introduce a novel nested IV assumption and study identification of the average treatment effect among two latent subgroups: always-compliers and switchers, who are defined based on the joint potential treatment received under two versions of a binary IV. We derive the efficient influence function for the SWitcher Average Treatment Effect (SWATE) under a nonparametric model and propose efficient estimators. We then propose formal statistical tests of the generalizability of IV estimates under the nested IV framework. The proposed tests are flexible nonparametric generalizations of classical overidentification tests that allow estimating nuisance parameters using machine learning tools. We apply the proposed method to the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial and study the causal effect of colorectal cancer screening and its generalizability.


[43] 2406.00778

Bayesian Joint Additive Factor Models for Multiview Learning

It is increasingly common to collect data of multiple different types on the same set of samples. Our focus is on studying relationships between such multiview features and responses. A motivating application arises in the context of precision medicine where multi-omics data are collected to correlate with clinical outcomes. It is of interest to infer dependence within and across views while combining multimodal information to improve the prediction of outcomes. The signal-to-noise ratio can vary substantially across views, motivating more nuanced statistical tools beyond standard late and early fusion. This challenge comes with the need to preserve interpretability, select features, and obtain accurate uncertainty quantification. To address these challenges, we introduce two complementary factor regression models. A baseline Joint Factor Regression (\textsc{jfr}) captures combined variation across views via a single factor set, and a more nuanced Joint Additive FActor Regression (\textsc{jafar}) that decomposes variation into shared and view-specific components. For \textsc{jfr}, we use independent cumulative shrinkage process (\textsc{i-cusp}) priors, while for \textsc{jafar} we develop a dependent version (\textsc{d-cusp}) designed to ensure identifiability of the components. We develop Gibbs samplers that exploit the model structure and accommodate flexible feature and outcome distributions. Prediction of time-to-labor onset from immunome, metabolome, and proteome data illustrates performance gains against state-of-the-art competitors. Our open-source software (\texttt{R} package) is available at this https URL.


[44] 2502.01953

Local minima of the empirical risk in high dimension: General theorems and convex examples

We consider a general model for high-dimensional empirical risk minimization whereby the data $\mathbf{x}_i$ are $d$-dimensional Gaussian vectors, the model is parametrized by $\mathbf{\Theta}\in\mathbb{R}^{d\times k}$, and the loss depends on the data via the projection $\mathbf{\Theta}^\mathsf{T}\mathbf{x}_i$. This setting covers as special cases classical statistics methods (e.g. multinomial regression and other generalized linear models), but also two-layer fully connected neural networks with $k$ hidden neurons. We use the Kac-Rice formula from Gaussian process theory to derive a bound on the expected number of local minima of this empirical risk, under the proportional asymptotics in which $n,d\to\infty$, with $n\asymp d$. Via Markov's inequality, this bound allows to determine the positions of these minimizers (with exponential deviation bounds) and hence derive sharp asymptotics on the estimation and prediction error. As a special case, we apply our characterization to convex losses. We show that our approach is tight and allows to prove previously conjectured results. In addition, we characterize the spectrum of the Hessian at the minimizer. A companion paper applies our general result to non-convex examples.


[45] 2502.06168

Dynamic Pricing with Adversarially-Censored Demands

We study an online dynamic pricing problem where the potential demand at each time period $t=1,2,\ldots, T$ is stochastic and dependent on the price. However, a perishable inventory is imposed at the beginning of each time $t$, censoring the potential demand if it exceeds the inventory level. To address this problem, we introduce a pricing algorithm based on the optimistic estimates of derivatives. We show that our algorithm achieves $\tilde{O}(\sqrt{T})$ optimal regret even with adversarial inventory series. Our findings advance the state-of-the-art in online decision-making problems with censored feedback, offering a theoretically optimal solution against adversarial observations.


[46] 2502.14600

Spectral decomposition-assisted multi-study factor analysis

This article focuses on covariance estimation for multi-study data. Popular approaches employ factor-analytic terms with shared and study-specific loadings that decompose the variance into (i) a shared low-rank component, (ii) study-specific low-rank components, and (iii) a diagonal term capturing idiosyncratic variability. Our proposed methodology estimates the latent factors via spectral decompositions, with a novel approach for separating shared and specific factors, and infers the factor loadings and residual variances via surrogate Bayesian regressions. The resulting posterior has a simple product form across outcomes, bypassing the need for Markov chain Monte Carlo sampling and facilitating parallelization. The proposed methodology has major advantages over current Bayesian competitors in terms of computational speed, scalability and stability while also having strong frequentist guarantees. The theory and methods also add to the rich literature on frequentist methods for factor models with shared and group-specific components of variation. The approximation error decreases as the sample size and the data dimension diverge, formalizing a blessing of dimensionality. We show favorable asymptotic properties, including central limit theorems for point estimators and posterior contraction, and excellent empirical performance in simulations. The methods are applied to integrate three studies on gene associations among immune cells.


[47] 2503.12634

Clustered random forests with correlated data for optimal estimation and inference under potential covariate shift

We develop Clustered Random Forests, a random forests algorithm for clustered data, arising from independent groups that exhibit within-cluster dependence. The leaf-wise predictions for each decision tree making up clustered random forests takes the form of a weighted least squares estimator, which leverage correlations between observations for improved prediction accuracy and tighter confidence intervals when performing inference. We show that approximately linear time algorithms exist for fitting classes of clustered random forests, matching the computational complexity of standard random forests. Further, we observe that the optimality of a clustered random forest, with regards to how optimal weights are chosen within this framework i.e. those that minimise mean squared prediction error, vary under covariate distribution shift. In light of this, we advocate weight estimation to be determined by a user-chosen covariate distribution, or test dataset of covariates, with respect to which optimal prediction or inference is desired. This highlights a key distinction between correlated and independent data with regards to optimality of nonparametric conditional mean estimation under covariate shift. We demonstrate our theoretical findings numerically in a number of simulated and real-world settings.


[48] 2504.21787

Estimation of discrete distributions in relative entropy, and the deviations of the missing mass

We study the problem of estimating a distribution over a finite alphabet from an i.i.d. sample, with accuracy measured in relative entropy (Kullback-Leibler divergence). While optimal bounds on the expected risk are known, high-probability guarantees remain less well-understood. First, we analyze the classical Laplace (add-one) estimator, obtaining matching upper and lower bounds on its performance and establishing its optimality among confidence-independent estimators. We then characterize the minimax-optimal high-probability risk and show that it is achieved by a simple confidence-dependent smoothing technique. Notably, the optimal non-asymptotic risk incurs an additional logarithmic factor compared to the ideal asymptotic rate. Next, motivated by regimes in which the alphabet size exceeds the sample size, we investigate methods that adapt to the sparsity of the underlying distribution. We introduce an estimator using data-dependent smoothing, for which we establish a high-probability risk bound depending on two effective sparsity parameters. As part of our analysis, we also derive a sharp high-probability upper bound on the missing mass.


[49] 2505.04613

Kernel Embeddings and the Separation of Measure Phenomenon

We prove that kernel covariance embeddings lead to information-theoretically perfect separation of distinct continuous probability distributions. In statistical terms, we establish that testing for the \emph{equality} of two non-atomic (Borel) probability measures on a locally compact Polish space is \emph{equivalent} to testing for the \emph{singularity} between two centered Gaussian measures on a reproducing kernel Hilbert space. The corresponding Gaussians are defined via the notion of kernel covariance embedding of a probability measure, and the Hilbert space is that generated by the embedding kernel. Distinguishing singular Gaussians is structurally simpler from an information-theoretic perspective than non-parametric two-sample testing, particularly in complex or high-dimensional domains. This is because singular Gaussians are supported on essentially separate and affine subspaces. Our proof leverages the classical Feldman-Hájek dichotomy, and shows that even a small perturbation of a continuous distribution will be maximally magnified through its Gaussian embedding. This ``separation of measure phenomenon'' appears to be a blessing of infinite dimensionality, by means of embedding, with the potential to inform the design of efficient inference tools in considerable generality. The elicitation of this phenomenon also appears to crystallize, in a precise and simple mathematical statement, a core mechanism underpinning the empirical effectiveness of kernel methods.


[50] 2506.16295

Understanding uncertainty in Bayesian cluster analysis

The Bayesian approach to clustering is often appreciated for its ability to provide uncertainty in the partition structure. However, summarizing the posterior distribution over the clustering structure can be challenging, due the discrete, unordered nature and massive dimension of the space. While recent advancements provide a single clustering estimate to represent the posterior, this ignores uncertainty and may even be unrepresentative in instances where the posterior is multimodal. To enhance our understanding of uncertainty, we propose a WASserstein Approximation for Bayesian clusterIng (WASABI), which summarizes the posterior samples with not one, but multiple clustering estimates, each corresponding to a different part of the partition space that receives substantial posterior mass. Specifically, we find such clustering estimates by approximating the posterior distribution in a Wasserstein distance sense, equipped with a suitable metric on the partition space. An interesting byproduct is that a locally optimal solution can be found using a k-medoids-like algorithm on the partition space to divide the posterior samples into groups, each represented by one of the clustering estimates. Using synthetic and real datasets, we show that WASABI helps to improve the understanding of uncertainty, particularly when clusters are not well separated or when the employed model is misspecified.


[51] 2506.20325

Robust estimation of a Markov chain transition matrix from multiple sample paths

Markov chains are fundamental models for stochastic dynamics, with applications in a wide range of areas such as population dynamics, queueing systems, reinforcement learning, and Monte Carlo methods. Estimating the transition matrix and stationary distribution from observed sample paths is a core statistical challenge, particularly when multiple independent trajectories are available. While classical theory typically assumes identical chains with known stationary distributions, real-world data often arise from heterogeneous chains whose transition kernels and stationary measures might differ from a common target. We analyse empirical estimators for such parallel Markov processes and establish sharp concentration inequalities that generalise Bernstein-type bounds from standard time averages to ensemble-time averages. Our results provide nonasymptotic error bounds and consistency guarantees in high-dimensional regimes, accommodating sparse or weakly mixing chains, model mismatch, nonstationary initialisations, and partially corrupted data. These findings offer rigorous foundations for statistical inference in heterogeneous Markov chain settings common in modern computational applications.


[52] 2507.09905

Statistical Analysis of Conditional Group Distributionally Robust Optimization with Cross-Entropy Loss

In multi-source learning with discrete labels, distributional heterogeneity across domains poses a central challenge to developing predictive models that transfer reliably to unseen domains. We study multi-source unsupervised domain adaptation, where labeled data are available from multiple source domains and only unlabeled data are observed from the target domain. To address potential distribution shifts, we propose a novel Conditional Group Distributionally Robust Optimization (CG-DRO) framework that learns a classifier by minimizing the worst-case cross-entropy loss over the convex combinations of the conditional outcome distributions from sources domains. We develop an efficient Mirror Prox algorithm for solving the minimax problem and employ a double machine learning procedure to estimate the risk function, ensuring that errors in nuisance estimation contribute only at higher-order rates. We establish fast statistical convergence rates for the empirical CG-DRO estimator by constructing two surrogate minimax optimization problems that serve as theoretical bridges. A distinguishing challenge for CG-DRO is the emergence of nonstandard asymptotics: the empirical CG-DRO estimator may fail to converge to a standard limiting distribution due to boundary effects and system instability. To address this, we introduce a perturbation-based inference procedure that enables uniformly valid inference, including confidence interval construction and hypothesis testing.


[53] 2509.25648

Chinese vs. World Bank Development Projects: Insights from Earth Observation and Computer Vision on Wealth Gains in Africa, 2002-2013

Debates about whether development projects improve living conditions persist, partly because observational estimates can be biased by incomplete adjustment and because reliable outcome data are scarce at the neighborhood level. We address both issues in a continent-scale, sector-specific evaluation of Chinese and World Bank projects across 9,899 neighborhoods in 36 African countries (2002-2013), representative of ~88% of the population. First, we use a recent dataset that measures living conditions with a machine-learned wealth index derived from contemporaneous satellite imagery, yielding a consistent panel of 6.7 km square mosaics. Second, to strengthen identification, we proxy officials' map-based placement criteria using pre-treatment daytime satellite images and fuse these with tabular covariates to estimate funder- and sector-specific ATEs via inverse-probability weighting. Incorporating imagery often shrinks effects relative to tabular-only models. On average, both donors raise wealth, with larger and more consistent gains for China; sector extremes in our sample include Trade and Tourism (330) for the World Bank (+12.29 IWI points), and Emergency Response (700) for China (+15.15). Assignment-mechanism analyses also show World Bank placement is often more predictable from imagery alone (as well as from tabular covariates). This suggests that Chinese project placements are more driven by non-visible, political, or event-driven factors than World Bank placements. To probe residual concerns about selection on observables, we also estimate within-neighborhood (unit) fixed-effects models at a spatial resolution about 67 times finer than prior fixed-effects analyses, leveraging the computer-vision-imputed IWI panels; these deliver smaller but, for Chinese projects, directionally consistent effects.


[54] 2510.11863

On the permutation equivariance principle for causal estimands

In many causal inference problems, multiple action variables, such as factors, mediators, or network units, often share a common causal role yet lack a natural ordering. To avoid ambiguity, the scientific interpretation of a vector of estimands should remain invariant under relabeling, an implicit principle we refer to as permutation equivariance. Permutation equivariance can be understood as the property that permuting the variables permutes the estimands in a trackable manner, such that scientific meaning is preserved. We formally characterize this principle and study its combinatorial algebra. We present a class of weighted estimands that project unstructured potential outcome means into a vector of permutation equivariant and interpretable estimands capturing all orders of interaction. To guide practice, we discuss the implications and choices of weights and define residual-free estimands, whose inclusion-exclusion sums capture the maximal effect, which is useful in context such as causal mediation and network interference. We present the application of our general theory to three canonical examples and extend our results to ratio effect measures.


[55] 2510.16745

Kernel-Based Nonparametric Tests For Shape Constraints

We propose a kernel-based nonparametric framework for mean-variance optimization that enables inference on economically motivated shape constraints in finance, including positivity, monotonicity, and convexity. Many central hypotheses in financial econometrics are naturally expressed as shape relations on latent functions (e.g., term premia, CAPM relations, and the pricing kernel), yet enforcing such constraints during estimation can mask economically meaningful violations; our approach therefore separates learning from validation by first estimating an unconstrained solution and then testing shape properties. We establish statistical properties of the regularized sample estimator and derive rigorous guarantees, including asymptotic consistency, a functional central limit theorem, and a finite-sample deviation bound achieving the Monte Carlo rate up to a regularization term. Building on these results, we construct a joint Wald-type statistic to test shape constraints on finite grids. An efficient algorithm based on a pivoted Cholesky factorization yields scalability to large datasets. Numerical studies, including an options-based asset-pricing application, illustrate the usefulness of the proposed method for evaluating monotonicity and convexity restrictions.


[56] 2510.18598

Measuring deviations from spherical symmetry

Most of the work on checking spherical symmetry assumptions on the distribution of the $p$-dimensional random vector $Y$ has its focus on statistical tests for the null hypothesis of exact spherical symmetry. In this paper, we take a different point of view and propose a measure for the deviation from spherical symmetry, which is based on the minimum distance between the distribution of the vector $\big (\|Y\|, Y/ \|Y\| )^\top $ and its best approximation by a distribution of a vector $\big (\|Y_s\|, Y_s/ \|Y_s \| )^\top $ corresponding to a random vector $Y_s$ with a spherical distribution. We develop estimators for the minimum distance with corresponding statistical guarantees (provided by asymptotic theory) and demonstrate the applicability of our approach by means of a simulation study and a real data example.


[57] 2511.18737

Joint learning of a network of linear dynamical systems via total variation penalization

We consider the problem of joint estimation of the parameters of $m$ linear dynamical systems, given access to single realizations of their respective trajectories, each of length $T$. The linear systems are assumed to reside on the nodes of an undirected and connected graph $G = ([m], \mathcal{E})$, and the system matrices are assumed to either vary smoothly or exhibit small number of ``jumps'' across the edges. We consider a total variation penalized least-squares estimator and derive non-asymptotic bounds on the mean squared error (MSE) which hold with high probability. In particular, the bounds imply for certain choices of well connected $G$ that the MSE goes to zero as $m$ increases, even when $T$ is constant. The theoretical results are supported by extensive experiments on synthetic and real data.


[58] 2601.01594

Variance-Reduced Diffusion Sampling via Target Score Identity

We study variance reduction for score estimation and diffusion-based sampling in settings where the clean (target) score is available or can be approximated. Starting from the Target Score Identity (TSI), which expresses the noisy marginal score as a conditional expectation of the target score under the forward diffusion, we develop: (i) a plug-and-play nonparametric self-normalized importance sampling estimator compatible with standard reverse-time solvers, (ii) a variance-minimizing \emph{state- and time-dependent} blending rule between Tweedie-type and TSI estimators together with an anti-correlation analysis, (iii) a data-only extension based on locally fitted proxy scores, and (iv) a likelihood-tilting extension to Bayesian inverse problems. We also propose a \emph{Critic--Gate} distillation scheme that amortizes the state-dependent blending coefficient into a neural gate. Experiments on synthetic targets and PDE-governed inverse problems demonstrate improved sample quality for a fixed simulation budget.


[59] 2601.09161

A Multilayer Probit Network Model for Community Detection with Dependent Edges and Layers

Community detection in multilayer networks, which aims to identify groups of nodes exhibiting similar connectivity patterns across multiple network layers, has attracted considerable attention in recent years. Most existing methods are based on the assumption that different layers are either independent or follow specific dependence structures, and edges within the same layer are independent. In this article, we propose a novel method for community detection in multilayer networks that accounts for a broad range of inter-layer and intra-layer dependence structures. The proposed method integrates the multilayer stochastic block model for community detection with a multivariate probit model to capture the structures of inter-layer dependence, which also allows intra-layer dependence. To facilitate parameter estimation, we develop a constrained pairwise likelihood method coupled with an efficient alternating updating algorithm. The asymptotic properties of the proposed method are also established, with a focus on examining the influence of inter-layer and intra-layer dependences on the accuracy of both parameter estimation and community detection. The theoretical results are supported by extensive numerical experiments on both simulated networks and a real-world multilayer trade network.


[60] 2601.10759

Mass Distribution versus Density Distribution in the Context of Clustering

This paper investigates two fundamental descriptors of data, i.e., density distribution versus mass distribution, in the context of clustering. Density distribution has been the de facto descriptor of data distribution since the introduction of statistics. We show that density distribution has its fundamental limitation -- high-density bias, irrespective of the algorithms used to perform clustering. Existing density-based clustering algorithms have employed different algorithmic means to counter the effect of the high-density bias with some success, but the fundamental limitation of using density distribution remains an obstacle to discovering clusters of arbitrary shapes, sizes and densities. Using the mass distribution as a better foundation, we propose a new algorithm which maximizes the total mass of all clusters, called mass-maximization clustering (MMC). The algorithm can be easily changed to maximize the total density of all clusters in order to examine the fundamental limitation of using density distribution versus mass distribution. The key advantage of the MMC over the density-maximization clustering is that the maximization is conducted without a bias towards dense clusters.


[61] 2601.11744

On Nonasymptotic Confidence Intervals for Treatment Effects in Randomized Experiments

We study nonasymptotic (finite-sample) confidence intervals for treatment effects in randomized experiments. In the existing literature, the effective sample sizes of nonasymptotic confidence intervals tend to be looser than the corresponding central-limit-theorem-based confidence intervals by a factor depending on the square root of the propensity score. We show that this performance gap can be closed, designing nonasymptotic confidence intervals that have the same effective sample size as their asymptotic counterparts. Our approach involves systematic exploitation of negative dependence or variance adaptivity (or both). We also show that the nonasymptotic rates that we achieve are unimprovable in an information-theoretic sense.


[62] 2402.08941

Local-Polynomial Estimation for Multivariate Regression Discontinuity Designs

We study a multivariate regression discontinuity design in which treatment is assigned by crossing a boundary in the space of multiple running variables. We document that the existing bandwidth selector is suboptimal for a multivariate regression discontinuity design when the distance to a boundary point is used for its running variable, and introduce a multivariate local-linear estimator for multivariate regression discontinuity designs. Our estimator is asymptotically valid and can capture heterogeneous treatment effects over the boundary. We demonstrate that our estimator exhibits smaller root mean squared errors and often shorter confidence intervals in numerical simulations. We illustrate our estimator in our empirical applications of multivariate designs of a Colombian scholarship study and a U.S. House of representative voting study and demonstrate that our estimator reveals richer heterogeneous treatment effects with often shorter confidence intervals than the existing estimator.


[63] 2407.11735

ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection

In open-set semi-supervised learning (OSSL), we consider unlabeled datasets that may contain unknown classes. Existing OSSL methods often use the softmax confidence for classifying data as in-distribution (ID) or out-of-distribution (OOD). Additionally, many works for OSSL rely on ad-hoc thresholds for ID/OOD classification, without considering the statistics of the problem. We propose a new score for ID/OOD classification based on angles in feature space between data and an ID subspace. Moreover, we propose an approach to estimate the conditional distributions of scores given ID or OOD data, enabling probabilistic predictions of data being ID or OOD. These components are put together in a framework for OSSL, termed ProSub, that is experimentally shown to reach SOTA performance on several benchmark problems. Our code is available at this https URL.


[64] 2408.17078

Aliasing Effects for Samples of Spin Random Fields on the Sphere

This paper investigates aliasing effects emerging from the reconstruction from discrete samples of spin spherical random fields defined on the two-dimensional sphere. We determine the location in the frequency domain and the intensity of the aliases of the harmonic coefficients in the Fourier decomposition of the spin random field and evaluate the consequences of aliasing errors in the angular power spectrum when the samples of the random field are obtained by using some very popular sampling procedures on the sphere, the equiangular and the Gauss-Jacobi sampling schemes. Finally, we demonstrate that band-limited spin random fields are free from aliases, provided that a sufficiently large number of nodes is used in the selected quadrature rule.


[65] 2410.15652

Sparse Hanson-Wright Inequalities with Applications

We derive new Hanson-Wright-type inequalities tailored to the quadratic forms of random vectors with sparse independent components. Specifically, we consider cases where the components of the random vector are sparse $\alpha$-subexponential random variables with $\alpha>0$. When $\alpha=\infty$, these inequalities can be seen as quadratic generalizations of the classical Bernstein and Bennett inequalities for sparse bounded random vectors. To establish this quadratic generalization, we also develop new Bernstein-type and Bennett-type inequalities for linear forms of sparse $\alpha$-subexponential random variables that go beyond the bounded case $(\alpha=\infty)$. Our proof relies on a novel combinatorial method for estimating the moments of both random linear forms and quadratic forms. We present two key applications of these new sparse Hanson-Wright inequalities: (1) A local law and complete eigenvector delocalization for sparse $\alpha$-subexponential Hermitian random matrices, generalizing the result of He et al. (2019) beyond sparse Bernoulli random matrices. To the best of our knowledge, this is the first local law and complete delocalization result for sparse $\alpha$-subexponential random matrices down to the near-optimal sparsity $p\geq \frac{\mathrm{polylog}(n)}{n}$ when $\alpha\in (0,2)$ as well as for unbounded sparse sub-gaussian random matrices down to the optimal sparsity $p\gtrsim \frac{\log n}{n}.$ (2) Concentration of the Euclidean norm for the linear transformation of a sparse $\alpha$-subexponential random vector, improving on the results of G{ö}tze et al. (2021) for sparse sub-exponential random vectors.


[66] 2505.15638

Bayesian Ensembling: Insights from Online Optimization and Empirical Bayes

We revisit the classical problem of Bayesian ensembles and address the challenge of learning optimal combinations of Bayesian models in an online, continual learning setting. To this end, we reinterpret existing approaches such as Bayesian model averaging (BMA) and Bayesian stacking through a novel empirical Bayes lens, shedding new light on the limitations and pathologies of BMA. Further motivated by insights from online optimization, we propose Online Bayesian Stacking (OBS), a method that optimizes the log-score over predictive distributions to adaptively combine Bayesian models. A key contribution of our work is establishing a novel connection between OBS and portfolio selection, bridging Bayesian ensemble learning with a rich, well-studied theoretical framework that offers efficient algorithms and extensive regret analysis. We further clarify the relationship between OBS and online BMA, showing that they optimize related but distinct cost functions. Through theoretical analysis and empirical evaluation, we identify scenarios where OBS outperforms online BMA and provide principled methods and guidance on when practitioners should prefer one approach over the other.


[67] 2508.08450

Differentiable Cyclic Causal Discovery Under Unmeasured Confounders

Understanding causal relationships between variables is fundamental across scientific disciplines. Most causal discovery algorithms rely on two key assumptions: (i) all variables are observed, and (ii) the underlying causal graph is acyclic. While these assumptions simplify theoretical analysis, they are often violated in real-world systems, such as biological networks. Existing methods that account for confounders either assume linearity or struggle with scalability. To address these limitations, we propose DCCD-CONF, a novel framework for differentiable learning of nonlinear cyclic causal graphs in the presence of unmeasured confounders using interventional data. Our approach alternates between optimizing the graph structure and estimating the confounder distribution by maximizing the log-likelihood of the data. Through experiments on synthetic data and real-world gene perturbation datasets, we show that DCCD-CONF outperforms state-of-the-art methods in both causal graph recovery and confounder identification. Additionally, we also provide consistency guarantees for our framework, reinforcing its theoretical soundness.


[68] 2509.25519

Flow Matching with Semidiscrete Couplings

Flow models parameterized as time-dependent velocity fields can generate data from noise by integrating an ODE. These models are often trained using flow matching, i.e. by sampling random pairs of noise and target points $(\mathbf{x}_0,\mathbf{x}_1)$ and ensuring that the velocity field is aligned, on average, with $\mathbf{x}_1-\mathbf{x}_0$ when evaluated along a segment linking $\mathbf{x}_0$ to $\mathbf{x}_1$. While these pairs are sampled independently by default, they can also be selected more carefully by matching batches of $n$ noise to $n$ target points using an optimal transport (OT) solver. Although promising in theory, the OT flow matching (OT-FM) approach is not widely used in practice. Zhang et al. (2025) pointed out recently that OT-FM truly starts paying off when the batch size $n$ grows significantly, which only a multi-GPU implementation of the Sinkhorn algorithm can handle. Unfortunately, the costs of running Sinkhorn can quickly balloon, requiring $O(n^2/\varepsilon^2)$ operations for every $n$ pairs used to fit the velocity field, where $\varepsilon$ is a regularization parameter that should be typically small to yield better results. To fulfill the theoretical promises of OT-FM, we propose to move away from batch-OT and rely instead on a semidiscrete formulation that leverages the fact that the target dataset distribution is usually of finite size $N$. The SD-OT problem is solved by estimating a dual potential vector using SGD; using that vector, freshly sampled noise vectors at train time can then be matched with data points at the cost of a maximum inner product search (MIPS). Semidiscrete FM (SD-FM) removes the quadratic dependency on $n/\varepsilon$ that bottlenecks OT-FM. SD-FM beats both FM and OT-FM on all training metrics and inference budget constraints, across multiple datasets, on unconditional/conditional generation, or when using mean-flow models.


[69] 2510.21935

AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing

Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.


[70] 2511.04350

On the relationship between MESP and 0/1 D-Opt and their upper bounds

We establish strong connections between two fundamental nonlinear 0/1 optimization problems coming from the area of experimental design, namely maximum entropy sampling and 0/1 D-Optimality. The connections are based on maps between instances, and we analyze the behavior of these maps. Using these maps, we transport basic upper-bounding methods between these two problems, and we are able to establish new domination results and other inequalities relating various basic upper bounds. Further, we establish results relating how different branch-and-bound schemes based on these maps compare. Additionally, we observe some surprising numerical results, where bounding methods that did not seem promising in their direct application to real-data MESP instances, are now useful for MESP instances that come from 0/1 D-Optimality.


[71] 2601.00973

Learned Hemodynamic Coupling Inference in Resting-State Functional MRI

Functional magnetic resonance imaging (fMRI) provides an indirect measurement of neuronal activity via hemodynamic responses that vary across brain regions and individuals. Ignoring this hemodynamic variability can bias downstream connectivity estimates. Furthermore, the hemodynamic parameters themselves may serve as important imaging biomarkers. Estimating spatially varying hemodynamics from resting-state fMRI (rsfMRI) is therefore an important but challenging blind inverse problem, since both the latent neural activity and the hemodynamic coupling are unknown. In this work, we propose a methodology for inferring hemodynamic coupling on the cortical surface from rsfMRI. Our approach avoids the highly unstable joint recovery of neural activity and hemodynamics by marginalizing out the latent neural signal and basing inference on the resulting marginal likelihood. To enable scalable, high-resolution estimation, we employ a deep neural network combined with conditional normalizing flows to accurately approximate this intractable marginal likelihood, while enforcing spatial coherence through priors defined on the cortical surface that admit sparse representations. Uncertainty in the hemodynamic estimates is quantified via a double-bootstrap procedure. The proposed approach is extensively validated using synthetic data and real fMRI datasets, demonstrating clear improvements over current methods for hemodynamic estimation and downstream connectivity analysis.