2025-03-27 | | Total: 9
We investigated the use of Empirical Mode Decomposition (EMD) combined with Gaussian Mixture Models (GMM), feature engineering and machine learning algorithms to optimize trading decisions. We used five, two, and one year samples of hourly candle data for GameStop, Tesla, and XRP (Ripple) markets respectively. Applying a 15 hour rolling window for each market, we collected several features based on a linear model and other classical features to predict the next hour's movement. Subsequently, a GMM filtering approach was used to identify clusters among these markets. For each cluster, we applied the EMD algorithm to extract high, medium, low and trend components from each feature collected. A simple thresholding algorithm was applied to classify market movements based on the percentage change in each market's close price. We then evaluated the performance of various machine learning models, including Random Forests (RF) and XGBoost, in classifying market movements. A naive random selection of trading decisions was used as a benchmark, which assumed equal probabilities for each outcome, and a temporal cross-validation approach was used to test models on 40%, 30%, and 20% of the dataset. Our results indicate that transforming selected features using EMD improves performance, particularly for ensemble learning algorithms like Random Forest and XGBoost, as measured by accumulated profit. Finally, GMM filtering expanded the range of learning algorithm and data source combinations that outperformed the top percentile of the random baseline.
Integrating data from multiple electronic sources allows researchers to expand studies without the cost of new data collection. However, such data are often collected for administrative or operational purposes rather than with specific future research questions in mind and, due to privacy constraints, unique identifiers are unavailable. This lack of direct identifiers requires the use of Record Linkage (RL) algorithms, which rely on partially identifying variables to probabilistically determine whether records belong to the same entity. Since these variables lack the strength to perfectly combine information, RL procedures typically yield an imperfect set of linked records. Therefore, assessing the false discovery rate (FDR) of RL is crucial for ensuring the reliability of subsequent analyses. In this paper, we introduce a novel method for estimating the FDR in RL by linking records from real and synthesised data. As synthetic records should never form links with real observations, they provide a means to estimate the FDR across different procedural settings. Notably, this method is applicable to all RL techniques. By identifying the FDR in RL results and selecting suitable model parameters, our approach enables to assess and improve the reliability of linked data. We evaluate the performance of our procedure using established RL algorithms and benchmark data sets before applying it to link siblings from the Netherlands Perinatal Registry, where the reliability of previous RL applications has never been confirmed. Through this application, we highlight the importance of accounting for linkage errors when studying mother-child dynamics in healthcare records.
Developing surrogates for computer models has become increasingly important for addressing complex problems in science and engineering. This article introduces an artificial intelligent (AI) surrogate, referred to as the DeepSurrogate, for analyzing functional outputs with vector-valued inputs. The relationship between the functional output and vector-valued input is modeled as an infinite sequence of unknown functions, each representing the relationship at a specific location within the functional domain. These spatially indexed functions are expressed through a combination of basis functions and their corresponding coefficient functions, both of which are modeled using deep neural networks (DNN). The proposed framework accounts for spatial dependencies across locations, while capturing the relationship between the functional output and scalar predictors. It also integrates a Monte Carlo (MC) dropout strategy to quantify prediction uncertainty, enhancing explainability in the deep neural network architecture. The proposed method enables efficient inference on datasets with approximately 50,000 spatial locations and 20 simulations, achieving results in under 10 minutes using standard hardware. The approach is validated on extensive synthetic datasets and a large-scale simulation from the Sea Lake and Overland Surge from Hurricanes (SLOSH) simulator. An open-source Python package implementing the method is made available.
A coefficient is introduced that quantifies the extent of separation of a random variable $Y$ relative to a number of variables $\mathbf{X} = (X_1, \dots, X_p)$ by skillfully assessing the sensitivity of the relative effects of the conditional distributions. The coefficient is as simple as classical dependence coefficients such as Kendall's tau, also requires no distributional assumptions, and consistently estimates an intuitive and easily interpretable measure, which is $0$ if and only if $Y$ is stochastically comparable relative to $\mathbf{X}$, that is, the values of $Y$ show no location effect relative to $\mathbf{X}$, and $1$ if and only if $Y$ is completely separated relative to $\mathbf{X}$. As a true generalization of the classical relative effect, in applications such as medicine and the social sciences the coefficient facilitates comparing the distributions of any number of treatment groups or categories. It hence avoids the sometimes artificial grouping of variable values such as patient's age into just a few categories, which is known to cause inaccuracy and bias in the data analysis. The mentioned benefits are exemplified using synthetic and real data sets.
Forecasting revenues by aggregating analyst forecasts is a fundamental problem in financial research and practice. A key objective in this context is to improve the accuracy of the forecast by optimizing two performance metrics: the hit rate, which measures the proportion of correctly classified revenue surprise signs, and the win rate, which quantifies the proportion of individual forecasts that outperform an equally weighted consensus benchmark. While researchers have extensively studied forecast combination techniques, two critical gaps remain: (i) the estimation of optimal combination weights tailored to these specific performance metrics and (ii) the development of Bayesian methods for handling missing or incomplete analyst forecasts. This paper proposes novel approaches to address these challenges. First, we introduce a method for estimating optimal forecast combination weights using exponentially weighted hit and win rate loss functions via nonlinear programming. Second, we develop a Bayesian imputation framework that leverages exponentially weighted likelihood methods to account for missing forecasts while preserving key distributional properties. Through extensive empirical evaluations using real-world analyst forecast data, we demonstrate that our proposed methodologies yield superior predictive performance compared to traditional equally weighted and linear combination benchmarks. These findings highlight the advantages of incorporating tailored loss functions and Bayesian inference in forecast combination models, offering valuable insights for financial analysts and practitioners seeking to improve revenue prediction accuracy.
Local sensitivity diagnostics for Bayesian models are described that are analogues of frequentist measures of leverage and influence. The diagnostics are simple to calculate using MCMC. A comparison between leverage and influence allows a general purpose definition of an outlier based on local perturbations. These outliers may indicate areas where the model does not fit well even if they do not influence model fit. The sensitivity diagnostics are closely related to predictive information criteria that are commonly used for Bayesian model choice. A diagnostic for prior-data conflict is proposed that may also be used to measure cross-conflict between different parts of the data.
We consider the problem of estimating the expected causal effect $E[Y|do(X)]$ for a target variable $Y$ when treatment $X$ is set by intervention, focusing on continuous random variables. In settings without selection bias or confounding, $E[Y|do(X)] = E[Y|X]$, which can be estimated using standard regression methods. However, regression fails when systematic missingness induced by selection bias, or confounding distorts the data. Boeken et al. [2023] show that when training data is subject to selection, proxy variables unaffected by this process can, under certain constraints, be used to correct for selection bias to estimate $E[Y|X]$, and hence $E[Y|do(X)]$, reliably. When data is additionally affected by confounding, however, this equality is no longer valid. Building on these results, we consider a more general setting and propose a framework that incorporates both selection bias and confounding. Specifically, we derive theoretical conditions ensuring identifiability and recoverability of causal effects under access to external data and proxy variables. We further introduce a two-step regression estimator (TSR), capable of exploiting proxy variables to adjust for selection bias while accounting for confounding. We show that TSR coincides with prior work if confounding is absent, but achieves a lower variance. Extensive simulation studies validate TSR's correctness for scenarios which may include both selection bias and confounding with proxy variables.
Statistical learning methods typically assume that the training and test data originate from the same distribution, enabling effective risk minimization. However, real-world applications frequently involve distributional shifts, leading to poor model generalization. To address this, recent advances in causal inference and robust learning have introduced strategies such as invariant causal prediction and anchor regression. While these approaches have been explored for traditional structural equation models (SEMs), their extension to functional systems remains limited. This paper develops a risk minimization framework for functional SEMs using linear, potentially unbounded operators. We introduce a functional worst-risk minimization approach, ensuring robust predictive performance across shifted environments. Our key contribution is a novel worst-risk decomposition theorem, which expresses the maximum out-of-sample risk in terms of observed environments. We establish conditions for the existence and uniqueness of the worst-risk minimizer and provide consistent estimation procedures. Empirical results on functional systems illustrate the advantages of our method in mitigating distributional shifts. These findings contribute to the growing literature on robust functional regression and causal learning, offering practical guarantees for out-of-sample generalization in dynamic environments.
It is common to use shrinkage methods such as empirical Bayes to improve estimates of teacher value-added. However, when the goal is to perform inference on coefficients in the regression of long-term outcomes on value-added, it's unclear whether shrinking the value-added estimators can help or hurt. In this paper, we consider a general class of value-added estimators and the properties of their corresponding regression coefficients. Our main finding is that regressing long-term outcomes on shrinkage estimates of value-added performs an automatic bias correction: the associated regression estimator is asymptotically unbiased, asymptotically normal, and efficient in the sense that it is asymptotically equivalent to regressing on the true (latent) value-added. Further, OLS standard errors from regressing on shrinkage estimates are consistent. As such, efficient inference is easy for practitioners to implement: simply regress outcomes on shrinkage estimates of value added.