2025-01-09 | | Total: 36
Algorithmic bias is a pressing concern in educational data mining (EDM), as it risks amplifying inequities in learning outcomes. The Area Between ROC Curves (ABROCA) metric is frequently used to measure discrepancies in model performance across demographic groups to quantify overall model fairness. However, its skewed distribution--especially when class or group imbalances exist--makes significance testing challenging. This study investigates ABROCA's distributional properties and contributes robust methods for its significance testing. Specifically, we address (1) whether ABROCA follows any known distribution, (2) how to reliably test for algorithmic bias using ABROCA, and (3) the statistical power achievable with ABROCA-based bias assessments under typical EDM sample specifications. Simulation results confirm that ABROCA does not match standard distributions, including those suited to accommodate skewness. We propose nonparametric randomization tests for ABROCA and demonstrate that reliably detecting bias with ABROCA requires large sample sizes or substantial effect sizes, particularly in imbalanced settings. Findings suggest that ABROCA-based bias evaluation based on sample sizes common in EDM tends to be underpowered, undermining the reliability of conclusions about model fairness. By offering open-source code to simulate power and statistically test ABROCA, this paper aims to foster more reliable statistical testing in EDM research. It supports broader efforts toward replicability and equity in educational modeling.
We introduce a new multimodal optimization approach called Natural Variational Annealing (NVA) that combines the strengths of three foundational concepts to simultaneously search for multiple global and local modes of black-box nonconvex objectives. First, it implements a simultaneous search by using variational posteriors, such as, mixtures of Gaussians. Second, it applies annealing to gradually trade off exploration for exploitation. Finally, it learns the variational search distribution using natural-gradient learning where updates resemble well-known and easy-to-implement algorithms. The three concepts come together in NVA giving rise to new algorithms and also allowing us to incorporate "fitness shaping", a core concept from evolutionary algorithms. We assess the quality of search on simulations and compare them to methods using gradient descent and evolution strategies. We also provide an application to a real-world inverse problem in planetary science.
Our objective is to construct well-calibrated prediction sets for a time-to-event outcome subject to right-censoring with guaranteed coverage. Our approach is inspired by modern conformal inference literature, in that, unlike classical frameworks, we obviate the need for a well-specified parametric or semi-parametric survival model to accomplish our goal. In contrast to existing conformal prediction methods for survival data, which restrict censoring to be of Type I, whereby potential censoring times are assumed to be fully observed on all units in both training and validation samples, we consider the more common right-censoring setting in which either only the censoring time or only the event time of primary interest is directly observed, whichever comes first. Under a standard conditional independence assumption between the potential survival and censoring times given covariates, we propose and analyze two methods to construct valid and efficient lower predictive bounds for the survival time of a future observation. The proposed methods build upon modern semiparametric efficiency theory for censored data, in that the first approach incorporates inverse-probability-of-censoring weighting (IPCW), while the second approach is based on augmented-inverse-probability-of-censoring weighting (AIPCW). For both methods, we formally establish asymptotic coverage guarantees, and demonstrate both via theory and empirical experiments that AIPCW substantially improves efficiency over IPCW in the sense that its coverage error bound is of second-order mixed bias type, that is \emph{doubly robust}, and therefore guaranteed to be asymptotically negligible relative to the coverage error of IPCW.
Variance based global sensitivity analysis measures the relevance of inputs to a single output using Sobol' indices. This paper extends the definition in a natural way to multiple outputs, directly measuring the relevance of inputs to the linkages between outputs in a correlation-like matrix of indices. The usual Sobol' indices constitute the diagonal of this matrix. Existence, uniqueness and uncertainty quantification are established by developing the indices from a putative multi-output model with quantified uncertainty. Sobol' matrices and their standard errors are related to the moments of the multi-output model, to enable calculation. These are benchmarked numerically against test functions (with added noise) whose Sobol' matrices are calculated analytically.
Spatially constrained clustering is an important field of research, particularly when it involves changes over time. Partitioning a map is not simple since there is a vast number of possible partitions within the search space. In spatio-temporal clustering, this task becomes even more difficult, as we must consider sequences of partitions. Motivated by these challenges, we introduce a Bayesian model for time-dependent sequences of spatial random partitions by proposing a prior distribution based on product partition models that correlates partitions. Additionally, we employ random spanning trees to facilitate the exploration of the partition search space and to guarantee spatially constrained clustering. This work is motivated by a relevant applied problem: identifying spatial and temporal patterns of mosquito-borne diseases. Given the overdispersion present in this type of data, we introduce a spatio-temporal Poisson mixture model in which mean and dispersion parameters vary according to spatio-temporal covariates. The proposed model is applied to analyze the number of dengue cases reported weekly from 2018 to 2023 in the Southeast region of Brazil. We also evaluate model performance using simulated data. Overall, the proposed model has proven to be a competitive approach for analyzing the temporal evolution of spatial clustering.
In interventional health studies, causal mediation analysis can be employed to investigate mechanisms through which the intervention affects the targeted health outcome. Identifying direct and indirect (i.e. mediated) effects from empirical data, however, becomes complicated if the mediator-outcome association is confounded by a variable itself affected by the treatment. Here, we investigate identification of mediational effects under such post-treatment confounding in a setting with a longitudinal mediator, time-to-event outcome and a trichotomous ordinal treatment-dependent confounder. We show that if the intervention always affects the treatment-dependent confounder only in one direction (monotonicity), the mediational effects are identified up to a sensitivity parameter and derive their empirical non-parametric expressions. The monotonicity assumption can be assessed from empirical data, based on restrictions on the conditional distribution of the treatment-dependent confounder. We avoid pitfalls related to post-treatment conditioning by treating the mediator as a functional entity and defining the time-to-event outcome as a restricted disease-free time. In an empirical analysis, we use data from the Finnish Diabetes Prevention Study to assess the extent to which the effect of a lifestyle intervention on avoiding type 2 diabetes is mediated through weight reduction in a high-risk population, with other health-related changes acting as treatment-dependent confounders.
In this study, we introduce the Spherical Double K-Means (SDKM) clustering method for text data. A novel approach for simultaneous clustering of terms and documents. Using the strengths of k-means, double k-means, and spherical k-means, SDKM addresses the challenges of high dimensionality, noise, and sparsity inherent in text analysis. We address the choice of the number of clusters, both for the words and documents, using the cluster validity index pseudo-F, and verify the reliability of the method through simulation studies. We apply SDKM to the corpus of US presidential inaugural addresses, spanning from George Washington in 1789 to Joe Biden in 2021. Our analysis reveals distinct clusters of words and documents that correspond to significant historical themes and periods, showcasing the method's ability to facilitate a deeper understanding of the data. Our findings demonstrate the efficacy of SDKM in uncovering underlying patterns in textual data.
The double sparse linear model, which has both group-wise and element-wise sparsity in regression coefficients, has attracted lots of attention recently. This paper establishes the sufficient and necessary relationship between the exact support recovery and the optimal minimum signal conditions in the double sparse model. Specifically, sharply under the proposed signal conditions, a two-stage double sparse iterative hard thresholding procedure achieves exact support recovery with a suitably chosen threshold parameter. Also, this procedure maintains asymptotic normality aligning with an OLS estimator given true support, hence holding the oracle properties. Conversely, we prove that no method can achieve exact support recovery if these signal conditions are violated. This fills a critical gap in the minimax optimality theory on support recovery of the double sparse model. Finally, numerical experiments are provided to support our theoretical findings.
We consider the problem of detecting a change point in a sequence of mean functions from a functional time series. We propose an L1 norm based methodology and establish its theoretical validity both for classical and for relevant hypotheses. We compare the proposed method with currently available methodology that is based on the L2 and supremum norms. Additionally we investigate the asymptotic behaviour under the alternative for all three methods and showcase both theoretically and empirically that the L1 norm achieves the best performance in a broad range of scenarios. We also propose a power enhancement component that improves the performance of the L1 test against sparse alternatives. Finally we apply the proposed methodology to both synthetic and real data.
We introduce ART, a distribution-free and model-agnostic framework for changepoint detection that provides finite-sample guarantees. ART transforms independent observations into real-valued scores via a symmetric function, ensuring exchangeability in the absence of changepoints. These scores are then ranked and aggregated to detect distributional changes. The resulting test offers exact Type-I error control, agnostic to specific distributional or model assumptions. Moreover, ART seamlessly extends to multi-scale settings, enabling robust multiple changepoint estimation and post-detection inference with finite-sample error rate control. By locally ranking the scores and performing aggregations across multiple prespecified intervals, ART identifies changepoint intervals and refines subsequent inference while maintaining its distribution-free and model-agnostic nature. This adaptability makes ART as a reliable and versatile tool for modern changepoint analysis, particularly in high-dimensional data contexts and applications leveraging machine learning methods.
It is established that the linear spectral statistics (LSS) of the smoothed periodogram estimate of the spectral coherence matrix of a complex Gaussian high-dimensional times series (yn) n∈Z with independent components satisfy at each frequency a central limit theorem in the asymptotic regime where the sample size N , the dimension M of the observation, and the smoothing span B both converge towards +∞ in such a way that M = O(N α ) for α \< 1 and M B → c, c ∈ (0, 1). It is deduced that two recentered and renormalized versions of the LSS, one based on an average in the frequency domain and the other one based on a sum of squares also in the frequency domain, and both evaluated over a well-chosen frequency grid, also verify a central limit theorem. These two statistics are proposed to test with controlled asymptotic level the hypothesis that the components of y are independent. Numerical simulations assess the performance of the two tests.
Shape constraints offer compelling advantages in nonparametric regression by enabling the estimation of regression functions under realistic assumptions, devoid of tuning parameters. However, most existing shape-constrained nonparametric regression methods, except additive models, impose too few restrictions on the regression functions. This often leads to suboptimal performance, such as overfitting, in multivariate contexts due to the curse of dimensionality. On the other hand, additive shape-constrained models are sometimes too restrictive because they fail to capture interactions among the covariates. In this paper, we introduce a novel approach for multivariate shape-constrained nonparametric regression, which allows interactions without suffering from the curse of dimensionality. Our approach is based on the notion of total concavity originally due to T. Popoviciu and recently described in Gal [24]. We discuss the characterization and computation of the least squares estimator over the class of totally concave functions and derive rates of convergence under standard assumptions. The rates of convergence depend on the number of covariates only logarithmically, and the estimator, therefore, is guaranteed to avoid the curse of dimensionality to some extent. We demonstrate that total concavity can be justified for many real-world examples and validate the efficacy of our approach through empirical studies on various real-world datasets.
We introduce an interpretable deep learning model for multivariate time series forecasting that prioritizes both predictive performance and interpretability - key requirements for understanding complex physical phenomena. Our model not only matches but often surpasses existing interpretability methods, achieving this without compromising accuracy. Through extensive experiments, we demonstrate its ability to identify the most relevant time series and lags that contribute to forecasting future values, providing intuitive and transparent explanations for its predictions. To minimize the need for manual supervision, the model is designed so one can robustly determine the optimal window size that captures all necessary interactions within the smallest possible time frame. Additionally, it effectively identifies the optimal model order, balancing complexity when incorporating higher-order terms. These advancements hold significant implications for modeling and understanding dynamic systems, making the model a valuable tool for applied and computational physicists.
Regression splines are largely used to investigate and predict data behavior, attracting the interest of mathematicians for their beautiful numerical properties, and of statisticians for their versatility with respect to the applications. Several penalized spline regression models are available in the literature, and the most commonly used ones in real-world applications are P-splines, which enjoy the advantages of penalized models while being easy to generalize across different functional spaces and higher degree order, because of their discrete penalty term. To face the different requirements imposed by the nature of the problem or the physical meaning of the expected values, the P-spline definition is often modified by additional hypotheses, often translated into constraints on the solution or its derivatives. In this framework, our work is motivated by the aim of getting approximation models that fall within pre-established thresholds. Specifically, starting from a set of observed data, we consider a P-spline constrained between some prefixed bounds. In our paper, we just consider 0 as lower bound, although our approach applies to more general cases. We propose to get nonnegativity by imposing lower bounds on selected sample points. The spline can be computed through a sequence of linearly constrained problems. We suggest a strategy to dynamically select the sample points, to avoid extremely dense sampling, and therefore try to reduce as much as possible the computational burden. We show through some computational experiments the reliability of our approach and the accuracy of the results compared to some state-of-the-art models.
Advancements in artificial intelligence (AI) and deep learning have led to neural networks being used to generate lightning-speed answers to complex questions, to paint like Monet, or to write like Proust. Leveraging their computational speed and flexibility, neural networks are also being used to facilitate fast, likelihood-free statistical inference. However, it is not straightforward to use neural networks with data that for various reasons are incomplete, which precludes their use in many applications. A recently proposed approach to remedy this issue inputs an appropriately padded data vector and a vector that encodes the missingness pattern to a neural network. While computationally efficient, this "masking" approach can result in statistically inefficient inferences. Here, we propose an alternative approach that is based on the Monte Carlo expectation-maximization (EM) algorithm. Our EM approach is likelihood-free, substantially faster than the conventional EM algorithm as it does not require numerical optimization at each iteration, and more statistically efficient than the masking approach. This research represents a prototype problem that indicates how improvements could be made in AI by introducing Bayesian statistical thinking. We compare the two approaches to missingness using simulated incomplete data from two models: a spatial Gaussian process model, and a spatial Potts model. The utility of the methodology is shown on Arctic sea-ice data and cryptocurrency data.
Understanding the expressive ability of a specific model is essential for grasping its capacity limitations. Recently, several studies have established circuit complexity bounds for Transformer architecture. Besides, the Visual AutoRegressive (VAR) model has risen to be a prominent method in the field of image generation, outperforming previous techniques, such as Diffusion Transformers, in generating high-quality images. We investigate the circuit complexity of the VAR model and establish a bound in this study. Our primary result demonstrates that the VAR model is equivalent to a simulation by a uniform TC0 threshold circuit with hidden dimension d≤O(n) and poly(n) precision. This is the first study to rigorously highlight the limitations in the expressive power of VAR models despite their impressive performance. We believe our findings will offer valuable insights into the inherent constraints of these models and guide the development of more efficient and expressive architectures in the future.
In data analysis, unexpected results often prompt researchers to revisit their procedures to identify potential issues. While some researchers may struggle to identify the root causes, experienced researchers can often quickly diagnose problems by checking a few key assumptions. These checked assumptions, or expectations, are typically informal, difficult to trace, and rarely discussed in publications. In this paper, we introduce the term *analysis validation checks* to formalize and externalize these informal assumptions. We then introduce a procedure to identify a subset of checks that best predict the occurrence of unexpected outcomes, based on simulations of the original data. The checks are evaluated in terms of accuracy, determined by binary classification metrics, and independence, which measures the shared information among checks. We demonstrate this approach with a toy example using step count data and a generalized linear model example examining the effect of particulate matter air pollution on daily mortality.
We consider the problem of weight uncertainty proposed by [Blundell et al. (2015). Weight uncertainty in neural network. In International conference on machine learning, 1613-1622, PMLR.] in neural networks {(NNs)} specialized for regression tasks. {We further} investigate the effect of variance uncertainty in {their model}. We show that including the variance uncertainty can improve the prediction performance of the Bayesian {NN}. Variance uncertainty enhances the generalization of the model {by} considering the posterior distribution over the variance parameter. { We examine the generalization ability of the proposed model using a function approximation} example and {further illustrate it with} the riboflavin genetic data set. {We explore fully connected dense networks and dropout NNs with} Gaussian and spike-and-slab priors, respectively, for the network weights.
We consider an interacting system of particles with value in Rd×Rd, governed by transport and diffusion on the first component, on that may serve as a representative model for kinetic models with a degenerate component. In a first part, we control the fluctuations of the empirical measure of the system around the solution of the corresponding Vlasov-Fokker-Planck equation by proving a Bernstein concentration inequality, extending a previous result of arXiv:2011.03762 in several directions. In a second part, we study the nonparametric statistical estimation of the classical solution of Vlasov-Fokker-Planck equation from the observation of the empirical measure and prove an oracle inequality using the Goldenshluger-Lepski methodology and we obtain minimax optimality. We then specialise on the FitzHugh-Nagumo model for populations of neurons. We consider a version of the model proposed in Mischler et al. arXiv:1503.00492 an optimally estimate the 6 parameters of the model by moment estimators.
While relations among individuals make an important part of data with scientific and business interests, existing statistical modeling of relational data has mainly been focusing on dyadic relations, i.e., those between two individuals. This article addresses the less studied, though commonly encountered, polyadic relations that can involve more than two individuals. In particular, we propose a new latent space model for hypergraphs using determinantal point processes, which is driven by the diversity within hyperedges and each node's popularity. This model mechanism is in contrast to existing hypergraph models, which are predominantly driven by similarity rather than diversity. Additionally, the proposed model accommodates broad types of hypergraphs, with no restriction on the cardinality and multiplicity of hyperedges, which previous models often have. Consistency and asymptotic normality of the maximum likelihood estimates of the model parameters have been established. The proof is challenging, owing to the special configuration of the parameter space. Further, we apply the projected accelerated gradient descent algorithm to obtain the parameter estimates, and we show its effectiveness in simulation studies. We also demonstrate an application of the proposed model on the What's Cooking data and present the embedding of food ingredients learned from cooking recipes using the model.
Modern artificial intelligence is supported by machine learning models (e.g., foundation models) that are pretrained on a massive data corpus and then adapted to solve a variety of downstream tasks. To summarize performance across multiple tasks, evaluation metrics are often aggregated into a summary metric, e.g., average accuracy across 10 question-answering tasks. When aggregating evaluation metrics, it is useful to incorporate uncertainty in the aggregate metric in order to gain a more realistic understanding of model performance. Our objective in this work is to demonstrate how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks. The methods we emphasize are bootstrapping, Bayesian hierarchical (i.e., multilevel) modeling, and the visualization of task weightings that consider standard errors. These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance. We use a popular ML benchmark, the Visual Task Adaptation Benchmark (VTAB), to demonstrate the usefulness of our approaches.
Clinical trials often collect data on multiple outcomes, such as overall survival (OS), progression-free survival (PFS), and response to treatment (RT). In most cases, however, study designs only use primary outcome data for interim and final decision-making. In several disease settings, clinically relevant outcomes, for example OS, become available years after patient enrollment. Moreover, the effects of experimental treatments on OS might be less pronounced compared to auxiliary outcomes such as RT. We develop a Bayesian decision-theoretic framework that uses both primary and auxiliary outcomes for interim and final decision-making. The framework allows investigators to control standard frequentist operating characteristics, such as the type I error rate, and can be used with auxiliary outcomes from emerging technologies, such as circulating tumor assays. False positive rates and other frequentist operating characteristics are rigorously controlled without any assumption about the concordance between primary and auxiliary outcomes. We discuss algorithms to implement this decision-theoretic approach and show that incorporating auxiliary information into interim and final decision-making can lead to relevant efficiency gains according to established and interpretable metrics.
Big data presents potential but unresolved value as a source for analysis and inference. However,selection bias, present in many of these datasets, needs to be accounted for so that appropriate inferences can be made on the target population. One way of approaching the selection bias issue is to first estimate the propensity of inclusion in the big dataset for each member of the big dataset, and then to apply these propensities in an inverse probability weighting approach to produce population estimates. In this paper, we provide details of a new variant of existing propensity score estimation methods that takes advantage of the ability to integrate the big data with a probability sample. We compare the ability of this method to produce efficient inferences for the target population with several alternative methods through an empirical study.
We continue to study the learning-theoretic foundations of generation by extending the results from Kleinberg and Mullainathan [2024] and Li et al. [2024] to account for noisy example streams. In the noiseless setting of Kleinberg and Mullainathan [2024] and Li et al. [2024], an adversary picks a hypothesis from a binary hypothesis class and provides a generator with a sequence of its positive examples. The goal of the generator is to eventually output new, unseen positive examples. In the noisy setting, an adversary still picks a hypothesis and a sequence of its positive examples. But, before presenting the stream to the generator, the adversary inserts a finite number of negative examples. Unaware of which examples are noisy, the goal of the generator is to still eventually output new, unseen positive examples. In this paper, we provide necessary and sufficient conditions for when a binary hypothesis class can be noisily generatable. We provide such conditions with respect to various constraints on the number of distinct examples that need to be seen before perfect generation of positive examples. Interestingly, for finite and countable classes we show that generatability is largely unaffected by the presence of a finite number of noisy examples.
We study the mixing time of the projected Langevin algorithm (LA) and the privacy curve of noisy Stochastic Gradient Descent (SGD), beyond nonexpansive iterations. Specifically, we derive new mixing time bounds for the projected LA which are, in some important cases, dimension-free and poly-logarithmic on the accuracy, closely matching the existing results in the smooth convex case. Additionally, we establish new upper bounds for the privacy curve of the subsampled noisy SGD algorithm. These bounds show a crucial dependency on the regularity of gradients, and are useful for a wide range of convex losses beyond the smooth case. Our analysis relies on a suitable extension of the Privacy Amplification by Iteration (PABI) framework (Feldman et al., 2018; Altschuler and Talwar, 2022, 2023) to noisy iterations whose gradient map is not necessarily nonexpansive. This extension is achieved by designing an optimization problem which accounts for the best possible Rényi divergence bound obtained by an application of PABI, where the tractability of the problem is crucially related to the modulus of continuity of the associated gradient mapping. We show that, in several interesting cases -- including the nonsmooth convex, weakly smooth and (strongly) dissipative -- such optimization problem can be solved exactly and explicitly. This yields the tightest possible PABI-based bounds, where our results are either new or substantially sharper than those in previous works.