New in Stata 19
We are excited to introduce you to the new features in Stata 19. See below for the highlights.
We are excited to introduce you to the new features in Stata 19. See below for the highlights.
Machine learning via H2O: Ensemble decision trees — Machine learning methods are often used to solve research and business problems focused on prediction when the problems require more advanced modeling than linear or generalized linear models. Ensemble decision tree methods, which combine multiple trees for better predictions, are popular for such tasks. H2O is a scalable machine learning platform that supports data analysis and machine learning, including ensemble decision tree methods such as random forest and gradient boosting machine (GBM).
The new h2oml suite of Stata commands is a wrapper for H2O that provides end-to-end support for H2O machine learning analysis using ensemble decision tree methods. After using the h2o
commands to initiate or connect to an existing H2O cluster, you can use the h2oml
commands to perform GBM and random forest for regression and classification problems. The h2oml
suite offers tools for hyperparameter tuning, validation, cross-validation, evaluating model performance, obtaining predictions, and explaining these predictions. For example,
. h2o init
— Initiate H2O from within Stata
. _h2oframe put, into(dataframe) current
— Import data from Stata into H2O
. h2oml gbbinclass response predictors, ntrees(20(10)200) lrate(0.1(0.1)1)
— Perform gradient boosting binary classification, and tune the number of trees and hyperparameters
. h2omlgraph varimp
— Assess variable importance
. _h2oframe change newdata
. h2omlpredict outcome_pred
— Make predictions
And there’s much more.
h2oml
suite offers ensemble decision tree methods in an easily accessible way by using familiar Stata syntax or the point-and-click interface. With prediction explainability tools such as Shapley additive explanations (SHAP) values, partial dependence plots, and variable importance rankings, GBM and random forest provide powerful predictions while maintaining explainability—no tradeoffs needed.Conditional average treatment effects (CATE) — Treatment effects estimate the causal effect of a treatment on an outcome. This effect may be constant or it may vary across different subpopulations. Researchers are often interested in whether and how treatment effects differ.
A labor economist may want to know the effect of a job training program on earnings only for those who participate in the program.
An online shopping company may want to know the effect of a price discount on purchasing behavior for customers with different demographic characteristics, such as age and income.
A medical team may want to measure the effect of smoking on stress levels for individuals in different age groups.
With the new cate command, you can go beyond estimating an overall treatment effect to estimating individualized or group-specific ones that address these types of research questions.
The cate
command can estimate three types of CATEs: individualized average treatment effects, group average treatment effects, and sorted group average treatment effects. Beyond estimation, the cate
suite provides features to predict, visualize, and make inferences about the CATEs.
cate
command is powerful, flexible, and robust. It offers modeling of outcome and treatment models by offering lasso, generalized random forest (sometimes called honest forest), and parametric models. It provides two robust estimators (partialing out and augmented inverse probability weighting) to guard against machine learning mistakes, and it uses cross-fitting to avoid overfitting.High-dimensional fixed effects (HDFE) — You can now absorb not just one, but multiple high-dimensional categorical variables in your linear regression, with or without fixed effects, and in linear models accounting for endogeneity using two-stage least squares. This is useful when you want your model to be adjusted for these variables but estimating their effect is not of interest and is computationally expensive.
The areg
, xtreg, fe
, and ivregress 2sls
commands now allow the absorb()
option to be specified with multiple categorical variables. Previously, areg
allowed only one variable in absorb()
, while xtreg, fe
and ivregress 2sls
did not allow the option.
For example, we could fit a regression model that adjusts for three high-dimensional categorical predictors c1
, c2
, and c3
by typing:
. areg y x, absorb(c1 c2 c3)
If we wanted to absorb these variables in a fixed-effects model, we can do that, too:
. xtset panelvar
. xtreg y x, fe absorb(c1 c2 c3)
And in an instrumental-variables regression model, we can type:
. ivregress 2sls y1 x1 (y2 x2), absorb(c1 c2 c3)
Bayesian variable selection for linear regression — The new bayesselect
command provides a flexible Bayesian approach to identify the subset of predictors that are most relevant to your outcome. It accounts for model uncertainty when estimating model parameters and performs Bayesian inference for regression coefficients. It uses a familiar syntax,
. bayesselect y x1-x100
As with other Bayesian regression procedures in Stata, posterior means, posterior standard deviations, Monte Carlo standard errors, and credible intervals of each predictor are reported for easy interpretation. Additionally, either inclusion coefficients or inclusion probabilities, depending on the selected prior, are included to indicate the importance of each predictor to model the outcome.
bayesselect
is fully integrated in Stata's Bayesian suite and works seamlessly with all Bayesian postestimation routines, including prediction,
. bayesselect pmean, mean
Marginal Cox PH models for interval-censored multiple-events data — Interval-censored multiple-event data commonly arise in longitudinal studies because each study subject may experience several types of events and those events are not observed directly but are known to occur within some time interval. For example, an epidemiologist studying chronic diseases might collect data on patients with multiple conditions, such as heart disease and metabolic disease, during different doctor visits. Similarly, a sociologist might conduct surveys to record major life events, such as job changes and marriages, at regular intervals.
You can now fit a marginal proportional hazards model for such data. The new stmgintcox
command can accommodate single- and multiple-record-per-event data and supports time-varying covariates for all events or specific ones.
For example, let’s say we have data on multiple events coded in the event variable that take place between times recorded in ltime
and rtime
and covariates x1
-x3
. We could simultaneously model the influence of covariates on time until each event using the command:
. stmgintcox x1 x2 x3, id(id) event(event) interval(ltime rtime)
From here, we could test the average effect of x1
across events by typing:
. estat common x1
We could also graph survivor and other functions for both events:
. stcurve, survival
Evaluate goodness of fit for each event:
. estat gofplot
And much more.
Meta-analysis for correlations — The meta suite now supports meta-analysis of correlation coefficients, allowing investigation of the strength and direction of relationships between variables across multiple studies. For instance, you may have studies reporting the correlation between education and income levels or between physical activity and improvements in mental health and wish to perform meta-analysis.
Say variables corr
and ntotal
represent the correlation and the total number of subjects in each study, respectively. We can use these variables to declare our data using the meta esize
command:
. meta esize corr ntotal, correlation studylabel(studylbl)
Because the variance of the untransformed correlation depends on the correlation itself, we may prefer to use the Fisher’s z-transformed correlation, a variance-stabilizing transformation particularly preferable when correlations are close to -1 or 1.
. meta esize corr ntotal, fisherz studylabel(studylbl)
All standard meta-analysis features, such as forest plots and subgroup analysis, are supported:
. meta forestplot, correlation
meta esize
one of the most flexible tools for meta-analysis available.Correlated random-effects (CRE) model — Easily fit CRE models to panel data with the new cre
option of the xtreg
command. Consider the following commands to fit a CRE model with time-varying regressor x
and time-invariant regressor z
:
. xtset panelvar
. xtreg y x z, cre vce(cluster panelvar)
A random-effects model may yield inconsistent estimates if there is correlation between the covariates and the unobserved panel-level effects. A fixed-effects model wouldn’t allow estimation of the coefficient on time-invariant regressor z
. CRE models offer the best of both worlds.
xtreg, fe
.Panel-data vector autoregressive (VAR) model — Fit vector autoregressive (VAR) models to panel data! Compute impulse–response functions, perform Granger causality tests and stability tests, include additional covariates, and much more. The new xtvar
command has similar syntax and postestimation procedures as var
, but it is appropriate for panel data rather than time-series data.
For example, we could fit a VAR model to a panel dataset with three outcomes of interest by typing:
. xtset panelvar
. xtvar y1 y2 y3, lags(2)
Then, we can perform a Granger causality test:
. vargranger
Or graph impulse–response functions:
. irf create baseline, set(irfs)
. irf graph irf
Bayesian bootstrap and replicate weights — You can use the new bayesboot
prefix to perform Bayesian bootstrap of statistics produced by official and community-contributed commands.
To compute a Bayesian bootstrap estimate of the mean of x
, which is returned by summarize
as r(mean)
, we type:
. bayesboot r(mean): summarize x
You can also use the new rwgen
command and new options for the bootstrap
prefix to implement specialized bootstrap schemes. rwgen
generates standard replication and Bayesian bootstrap weights. bootstrap
has new fweights()
and iweights()
options for performing bootstrap replications using the custom weights. fweights()
allows users to specify frequency weight variables for resampling, and iweights()
lets users provide importance weight variables.
These options extend bootstrap
's flexibility by allowing user-supplied weights instead of internal resampling, making it easier to implement specialized bootstrap schemes and enhance reproducibility. bayesboot
is a wrapper for rwgen
and bootstrap
that generates importance weights using Dirichlet distribution and applies these weights when bootstrapping.
Control-function linear and probit models — Fit control-function linear and probit models with the new cfregress
and cfprobit
commands. Control-function models offer a more flexible approach to traditional instrumental-variables (IV) methods by including the endogenous variable itself and its first-stage residual in the main regression; the residual term is called a control function.
For example, we could reproduce the estimates of a 2SLS IV regression:
. cfregress y1 x (y2 = z1 z2)
But we could also use a binary endogenous variable and include the interaction of the control function with z1
:
. cfregress y1 x (y2bin = z1 z2, probit interact(z1))
Afterward, we could test for endogeneity by jointly testing the control function and the interaction:
. estat endogenous
Bayesian quantile regression via asymmetric Laplace likelihood — The qreg
command for quantile regression is now compatible with the bayes
prefix. In the Bayesian framework, we combine the asymmetric Laplace likelihood function with priors to provide full posterior distributions for quantile regression coefficients.
. bayes: qreg y x1 x2
Consequently, the asymmetric Laplace distribution is also a new likelihood function available in bayesmh
.
. bayesmh y x1 x2, likelihood(asymlaplaceq({scale},0.5))
prior({y:}, normal(0,10000)) block({y:})
prior({scale}, igamma(0.01,0.01)) block({scale})
You can also use the asymmetric Laplace likelihood in bayesmh
for random-effects quantile regression, simultaneous quantile regression, or to model nonnormal outcomes with pronounced skewness and kurtosis.
All implementations support standard Bayesian features, such as MCMC diagnostics, hypothesis testing, and prediction.
. bayesgraph diagnostics
Inference robust to weak instruments — To estimate a linear regression of y1
on x1
and endogenous regressor y2
that is instrumented by z1
via 2SLS, we would type:
. ivregress 2sls y1 x1 (y2 = z1)
When the instrument, z1
, is only weakly correlated with the endogenous regressor, y2
, inference can become unreliable even in relatively large samples. The new estat weakrobust
postestimation command after ivregress
performs Anderson–Rubin or conditional likelihood-ratio (CLR) tests on the endogenous regressors. These tests are robust to the instrument being weak.
. estat weakrobust
This postestimation command supports all ivregress
estimators: 2SLS, LIML, and GMM.
estat weakrobust
not only are robust to weak instruments but also account for the robust, cluster-robust, or heteroskedasticity- and autocorrelation-consistent variance-covariance estimator used in ivregress
.SVAR models via instrumental variables — The new ivsvar
command estimates the parameters of SVAR models by using instrumental variables.
. ivsvar gmm y1 y2 (shock = z1 z2)
These estimated parameters can be used to trace out dynamic causal effects known as structural impulse–response functions (IRFs) using the familiar irf
suite of commands.
. irf set ivsvar.irf
. irf create model1
. irf graph sirf, impulse(shock)
For multiple instruments, use the minimum distance estimator with ivsvar mdist
, and specify how the instruments are related to the target shocks.
Instrumental-variables local-projection IRFs — With the new ivlpirf
command, you can account for endogeneity when using local projections to estimate dynamic causal effects.
Local projections are used to estimate the effect of shocks on outcome variables. When the shock of interest is on an impulse variable that may be endogenous, ivlpirf
can be used to estimate the IRFs, and the impulse variable may be instrumented using one or more exogenous instruments.
For example, let’s say we are interested in estimating structural IRFs for the effects of an increase in x on y, using iv as an instrument for the endogenous impulse x:
. ivlpirf y, endogenous(x = iv)
We can then use the irf
suite of commands to graph these IRFs:
. irf set ivlp.irf, replace
. irf create ivlp
. irf graph csirf
Mundlak specification test — Use the new estat mundlak
postestimation command after xtreg
to choose between random-effects (RE) and fixed-effects (FE) or correlated random-effects (CRE) models. Unlike a Hausman test, we do not need to fit both the RE and FE models to perform a Mundlak test — we just need one!
Again, consider the following model with time-varying regressor x and time-invariant regressor z:
. xtreg y x z, vce(cluster clustvar)
. estat mundlak
The estat mundlak
command tests the null hypothesis that x is uncorrelated with unobserved panel-level effects. Rejecting the hypothesis suggests that fitting an FE or CRE model that accounts for time-invariant unobserved heterogeneity is more sensible than an RE model.
Latent class model-comparison statistics — When you perform latent class analysis or finite mixture modeling, it is fundamental to determine the number of latent classes that best fit your data. With the new lcstats
command, you can use statistics such as entropy and a variety of information criteria, as well as the Lo–Mendell–Rubin (LMR) adjusted likelihood-ratio test and Vuong–Lo–Mendell–Rubin (VLMR) likelihood-ratio test, to help you determine the appropriate number of classes.
For example, you might fit one-class, two-class, and three-class models and store their results by typing:
. gsem (y1 y2 y3 y4 <- ), logit lclass(C 1)
. estimates store oneclass
. gsem (y1 y2 y3 y4 <- ), logit lclass(C 2)
. estimates store twoclass
. gsem (y1 y2 y3 y4 <- ), logit lclass(C 3)
. estimates store threeclass
Then you can obtain model-comparison statistics and tests by typing:
. lcstats
The lcstats
command offers options for specifying which statistics and tests to report and to customize the look of the table.
lcstats
are automatically placed in a collection, which means they are very easy to further customize and export to a variety of file types by using the collect commands.Do-file Editor: Autocompletion, templates, and more — The Do-file Editor now includes:
Graphics: Bar graph CIs, heat maps, and more — Stata 19 introduces several new graphics features:
twoway heatmap
command.groupyvars
.twoway
plots.Tables: Easier tabulations, exporting, and more — Stata 19 introduces several new table features:
title()
, note()
, and export()
options.anova
collection style and store results in r(ANOVA)
.collect get
for better result labels and layout control.fvlevels()
option.collect()
.Stata in French — Stata's menus and dialogs can now be displayed in French. If your computer language is set to French, Stata will automatically switch to this setting. You can also change the language manually via preferences or the set locale_ui
command.
Additional new features — Stata 19 includes the following enhancements:
For a complete list of available features, visit the Stata website.
Buy online or contact our sales team for a customised quote based on:
Save yourself valuable time. Find out about available training courses and resources to become proficient in Stata.
Stata 19 is here! - 10 Apr 2025
Stata 18 is available - 25 Apr 2023