New in Stata 19

We are excited to introduce you to the new features in Stata 19. See below for the highlights.

Machine learning via H2O: Ensemble decision trees — Machine learning methods are often used to solve research and business problems focused on prediction when the problems require more advanced modeling than linear or generalized linear models. Ensemble decision tree methods, which combine multiple trees for better predictions, are popular for such tasks. H2O is a scalable machine learning platform that supports data analysis and machine learning, including ensemble decision tree methods such as random forest and gradient boosting machine (GBM).

The new h2oml suite of Stata commands is a wrapper for H2O that provides end-to-end support for H2O machine learning analysis using ensemble decision tree methods. After using the h2o commands to initiate or connect to an existing H2O cluster, you can use the h2oml commands to perform GBM and random forest for regression and classification problems. The h2oml suite offers tools for hyperparameter tuning, validation, cross-validation, evaluating model performance, obtaining predictions, and explaining these predictions. For example,

. h2o init — Initiate H2O from within Stata

. _h2oframe put, into(dataframe) current — Import data from Stata into H2O

. h2oml gbbinclass response predictors, ntrees(20(10)200) lrate(0.1(0.1)1) — Perform gradient boosting binary classification, and tune the number of trees and hyperparameters

. h2omlgraph varimp — Assess variable importance

. _h2oframe change newdata

. h2omlpredict outcome_pred — Make predictions

And there’s much more.
- What makes it unique or exciting? - The h2oml suite offers ensemble decision tree methods in an easily accessible way by using familiar Stata syntax or the point-and-click interface. With prediction explainability tools such as Shapley additive explanations (SHAP) values, partial dependence plots, and variable importance rankings, GBM and random forest provide powerful predictions while maintaining explainability—no tradeoffs needed.
- Who will use it? - All disciplines; anyone interested in machine learning for classification and regression.

Conditional average treatment effects (CATE) — Treatment effects estimate the causal effect of a treatment on an outcome. This effect may be constant or it may vary across different subpopulations. Researchers are often interested in whether and how treatment effects differ.

A labor economist may want to know the effect of a job training program on earnings only for those who participate in the program.

An online shopping company may want to know the effect of a price discount on purchasing behavior for customers with different demographic characteristics, such as age and income.

A medical team may want to measure the effect of smoking on stress levels for individuals in different age groups.

With the new cate command, you can go beyond estimating an overall treatment effect to estimating individualized or group-specific ones that address these types of research questions.

The cate command can estimate three types of CATEs: individualized average treatment effects, group average treatment effects, and sorted group average treatment effects. Beyond estimation, the cate suite provides features to predict, visualize, and make inferences about the CATEs.
- What makes it unique or exciting? - The cate command is powerful, flexible, and robust. It offers modeling of outcome and treatment models by offering lasso, generalized random forest (sometimes called honest forest), and parametric models. It provides two robust estimators (partialing out and augmented inverse probability weighting) to guard against machine learning mistakes, and it uses cross-fitting to avoid overfitting.
- Who will use it? - All disciplines. Anyone interested in causal inference.

High-dimensional fixed effects (HDFE) — You can now absorb not just one, but multiple high-dimensional categorical variables in your linear regression, with or without fixed effects, and in linear models accounting for endogeneity using two-stage least squares. This is useful when you want your model to be adjusted for these variables but estimating their effect is not of interest and is computationally expensive.

The areg, xtreg, fe, and ivregress 2sls commands now allow the absorb() option to be specified with multiple categorical variables. Previously, areg allowed only one variable in absorb(), while xtreg, fe and ivregress 2sls did not allow the option.

For example, we could fit a regression model that adjusts for three high-dimensional categorical predictors c1, c2, and c3 by typing:

. areg y x, absorb(c1 c2 c3)

If we wanted to absorb these variables in a fixed-effects model, we can do that, too:

. xtset panelvar

. xtreg y x, fe absorb(c1 c2 c3)

And in an instrumental-variables regression model, we can type:

. ivregress 2sls y1 x1 (y2 x2), absorb(c1 c2 c3)
- What makes it unique or exciting? - Absorbing high-dimensional categorical variables, rather than including indicators for them in your model, results in remarkable speed gains.
- Who will use it? - All disciplines. Almost everyone uses linear regression at some point. Economists and political scientists who work with panel data will be especially excited about this new feature.

Bayesian variable selection for linear regression — The new bayesselect command provides a flexible Bayesian approach to identify the subset of predictors that are most relevant to your outcome. It accounts for model uncertainty when estimating model parameters and performs Bayesian inference for regression coefficients. It uses a familiar syntax,

. bayesselect y x1-x100

As with other Bayesian regression procedures in Stata, posterior means, posterior standard deviations, Monte Carlo standard errors, and credible intervals of each predictor are reported for easy interpretation. Additionally, either inclusion coefficients or inclusion probabilities, depending on the selected prior, are included to indicate the importance of each predictor to model the outcome.

bayesselect is fully integrated in Stata's Bayesian suite and works seamlessly with all Bayesian postestimation routines, including prediction,

. bayesselect pmean, mean
- What makes it unique or exciting? - This variable-selection approach offers intuitive interpretation and stable inference.
- Who will use it? - Social scientists with large datasets.

Marginal Cox PH models for interval-censored multiple-events data — Interval-censored multiple-event data commonly arise in longitudinal studies because each study subject may experience several types of events and those events are not observed directly but are known to occur within some time interval. For example, an epidemiologist studying chronic diseases might collect data on patients with multiple conditions, such as heart disease and metabolic disease, during different doctor visits. Similarly, a sociologist might conduct surveys to record major life events, such as job changes and marriages, at regular intervals.

You can now fit a marginal proportional hazards model for such data. The new stmgintcox command can accommodate single- and multiple-record-per-event data and supports time-varying covariates for all events or specific ones.

For example, let’s say we have data on multiple events coded in the event variable that take place between times recorded in ltime and rtime and covariates x1-x3. We could simultaneously model the influence of covariates on time until each event using the command:

. stmgintcox x1 x2 x3, id(id) event(event) interval(ltime rtime)

From here, we could test the average effect of x1 across events by typing:

. estat common x1

We could also graph survivor and other functions for both events:

. stcurve, survival

Evaluate goodness of fit for each event:

. estat gofplot

And much more.
- What makes it unique or exciting? - No other commercial software has a package that can fit marginal proportional hazards models for multivariate interval-censored data.
- Who will use it? - All disciplines, particularly medicine, epidemiology, biology, and sociology.

Meta-analysis for correlations — The meta suite now supports meta-analysis of correlation coefficients, allowing investigation of the strength and direction of relationships between variables across multiple studies. For instance, you may have studies reporting the correlation between education and income levels or between physical activity and improvements in mental health and wish to perform meta-analysis.

Say variables corr and ntotal represent the correlation and the total number of subjects in each study, respectively. We can use these variables to declare our data using the meta esize command:

. meta esize corr ntotal, correlation studylabel(studylbl)

Because the variance of the untransformed correlation depends on the correlation itself, we may prefer to use the Fisher’s z-transformed correlation, a variance-stabilizing transformation particularly preferable when correlations are close to -1 or 1.

. meta esize corr ntotal, fisherz studylabel(studylbl)

All standard meta-analysis features, such as forest plots and subgroup analysis, are supported:

. meta forestplot, correlation
- What makes it unique or exciting? - Correlation studies are a cornerstone in many fields of research. Adding this feature makes meta esize one of the most flexible tools for meta-analysis available.
- Who will use it? - All disciplines. Researchers in any discipline may wish to combine results of previous studies to estimate an overall effect.

Correlated random-effects (CRE) model — Easily fit CRE models to panel data with the new cre option of the xtreg command. Consider the following commands to fit a CRE model with time-varying regressor x and time-invariant regressor z:

. xtset panelvar

. xtreg y x z, cre vce(cluster panelvar)

A random-effects model may yield inconsistent estimates if there is correlation between the covariates and the unobserved panel-level effects. A fixed-effects model wouldn’t allow estimation of the coefficient on time-invariant regressor z. CRE models offer the best of both worlds.
- What makes it unique or exciting? - Estimate coefficients for time-invariant regressors while getting the same coefficients for time-varying regressors as those of xtreg, fe.
- Who will use it? - Social scientists and health researchers who work with panel data.

Panel-data vector autoregressive (VAR) model — Fit vector autoregressive (VAR) models to panel data! Compute impulse–response functions, perform Granger causality tests and stability tests, include additional covariates, and much more. The new xtvar command has similar syntax and postestimation procedures as var, but it is appropriate for panel data rather than time-series data.

For example, we could fit a VAR model to a panel dataset with three outcomes of interest by typing:

. xtset panelvar

. xtvar y1 y2 y3, lags(2)

Then, we can perform a Granger causality test:

. vargranger

Or graph impulse–response functions:

. irf create baseline, set(irfs)

. irf graph irf
- What makes it unique or exciting? - Panel-data VAR models have been available through community-contributed commands but have remained a highly requested feature from our users.
- Who will use it? - All disciplines. Social scientists who work with panel data will be especially excited about this new feature.

Bayesian bootstrap and replicate weights — You can use the new bayesboot prefix to perform Bayesian bootstrap of statistics produced by official and community-contributed commands.

To compute a Bayesian bootstrap estimate of the mean of x, which is returned by summarize as r(mean), we type:

. bayesboot r(mean): summarize x

You can also use the new rwgen command and new options for the bootstrap prefix to implement specialized bootstrap schemes. rwgen generates standard replication and Bayesian bootstrap weights. bootstrap has new fweights() and iweights() options for performing bootstrap replications using the custom weights. fweights() allows users to specify frequency weight variables for resampling, and iweights() lets users provide importance weight variables.

These options extend bootstrap's flexibility by allowing user-supplied weights instead of internal resampling, making it easier to implement specialized bootstrap schemes and enhance reproducibility. bayesboot is a wrapper for rwgen and bootstrap that generates importance weights using Dirichlet distribution and applies these weights when bootstrapping.
- What makes it unique or exciting? - Bayesian bootstrap can be used to obtain more precise parameter estimates in small samples and incorporate prior information when sampling observations.
- Who will use it? - All disciplines, especially researchers in statistics, biostatistics, and health fields.

Control-function linear and probit models — Fit control-function linear and probit models with the new cfregress and cfprobit commands. Control-function models offer a more flexible approach to traditional instrumental-variables (IV) methods by including the endogenous variable itself and its first-stage residual in the main regression; the residual term is called a control function.

For example, we could reproduce the estimates of a 2SLS IV regression:

. cfregress y1 x (y2 = z1 z2)

But we could also use a binary endogenous variable and include the interaction of the control function with z1:

. cfregress y1 x (y2bin = z1 z2, probit interact(z1))

Afterward, we could test for endogeneity by jointly testing the control function and the interaction:

. estat endogenous
- What makes it unique or exciting? - First-stage models can be linear, probit, fractional probit, or Poisson, and their control functions can be interacted with other variables or with each other. Robust, cluster–robust, and heteroskedasticity- and autocorrelation-consistent VCEs are allowed.
- Who will use it? - Researchers in the social sciences, particularly economics, public policy, political science, public health, and management.

Bayesian quantile regression via asymmetric Laplace likelihood — The qreg command for quantile regression is now compatible with the bayes prefix. In the Bayesian framework, we combine the asymmetric Laplace likelihood function with priors to provide full posterior distributions for quantile regression coefficients.

. bayes: qreg y x1 x2

Consequently, the asymmetric Laplace distribution is also a new likelihood function available in bayesmh.

. bayesmh y x1 x2, likelihood(asymlaplaceq({scale},0.5))

prior({y:}, normal(0,10000)) block({y:})

prior({scale}, igamma(0.01,0.01)) block({scale})

You can also use the asymmetric Laplace likelihood in bayesmh for random-effects quantile regression, simultaneous quantile regression, or to model nonnormal outcomes with pronounced skewness and kurtosis.

All implementations support standard Bayesian features, such as MCMC diagnostics, hypothesis testing, and prediction.

. bayesgraph diagnostics
- What makes it unique or exciting? - In classical quantile regression, standard errors are computed using bootstrap or kernel-based methods. In the Bayesian framework, the posterior standard deviations are estimated based on the model and may be more efficient.
- Who will use it? - All disciplines. Researchers in any discipline may be interested in Bayesian analysis.

Inference robust to weak instruments — To estimate a linear regression of y1 on x1 and endogenous regressor y2 that is instrumented by z1 via 2SLS, we would type:

. ivregress 2sls y1 x1 (y2 = z1)

When the instrument, z1, is only weakly correlated with the endogenous regressor, y2, inference can become unreliable even in relatively large samples. The new estat weakrobust postestimation command after ivregress performs Anderson–Rubin or conditional likelihood-ratio (CLR) tests on the endogenous regressors. These tests are robust to the instrument being weak.

. estat weakrobust

This postestimation command supports all ivregress estimators: 2SLS, LIML, and GMM.
- What makes it unique or exciting? - The tests and confidence intervals reported by estat weakrobust not only are robust to weak instruments but also account for the robust, cluster-robust, or heteroskedasticity- and autocorrelation-consistent variance-covariance estimator used in ivregress.
- Who will use it? - Researchers in the social sciences, particularly economics, public policy, political science, public health, and management.

SVAR models via instrumental variables — The new ivsvar command estimates the parameters of SVAR models by using instrumental variables.

. ivsvar gmm y1 y2 (shock = z1 z2)

These estimated parameters can be used to trace out dynamic causal effects known as structural impulse–response functions (IRFs) using the familiar irf suite of commands.

. irf set ivsvar.irf
. irf create model1
. irf graph sirf, impulse(shock)

For multiple instruments, use the minimum distance estimator with ivsvar mdist, and specify how the instruments are related to the target shocks.
- What makes it unique or exciting? - By relying on instruments, we don’t need to place constraints on the effects of shocks on endogenous variables as in a traditional SVAR model.
- Who will use it? - Anyone working with time-series data, including researchers in economics, political science, finance, and public policy.

Instrumental-variables local-projection IRFs — With the new ivlpirf command, you can account for endogeneity when using local projections to estimate dynamic causal effects.

Local projections are used to estimate the effect of shocks on outcome variables. When the shock of interest is on an impulse variable that may be endogenous, ivlpirf can be used to estimate the IRFs, and the impulse variable may be instrumented using one or more exogenous instruments.

For example, let’s say we are interested in estimating structural IRFs for the effects of an increase in x on y, using iv as an instrument for the endogenous impulse x:

. ivlpirf y, endogenous(x = iv)

We can then use the irf suite of commands to graph these IRFs:

. irf set ivlp.irf, replace
. irf create ivlp
. irf graph csirf
- What makes it unique or exciting? - Estimate dynamic causal effects that account for endogeneity.
- Who will use it? - Anyone working with time-series data, including researchers in economics, political science, finance, and public policy.

Mundlak specification test — Use the new estat mundlak postestimation command after xtreg to choose between random-effects (RE) and fixed-effects (FE) or correlated random-effects (CRE) models. Unlike a Hausman test, we do not need to fit both the RE and FE models to perform a Mundlak test — we just need one!

Again, consider the following model with time-varying regressor x and time-invariant regressor z:

. xtreg y x z, vce(cluster clustvar)
. estat mundlak

The estat mundlak command tests the null hypothesis that x is uncorrelated with unobserved panel-level effects. Rejecting the hypothesis suggests that fitting an FE or CRE model that accounts for time-invariant unobserved heterogeneity is more sensible than an RE model.
- What makes it unique or exciting? - Unlike the Hausman test for FE versus RE, the Mundlak test provides valid inference with cluster-robust, bootstrap, and jackknife standard errors.
- Who will use it? - Social scientists and health researchers who work with panel data, particularly economists and political scientists.

Latent class model-comparison statistics — When you perform latent class analysis or finite mixture modeling, it is fundamental to determine the number of latent classes that best fit your data. With the new lcstats command, you can use statistics such as entropy and a variety of information criteria, as well as the Lo–Mendell–Rubin (LMR) adjusted likelihood-ratio test and Vuong–Lo–Mendell–Rubin (VLMR) likelihood-ratio test, to help you determine the appropriate number of classes.

For example, you might fit one-class, two-class, and three-class models and store their results by typing:

. gsem (y1 y2 y3 y4 <- ), logit lclass(C 1)
. estimates store oneclass
. gsem (y1 y2 y3 y4 <- ), logit lclass(C 2)
. estimates store twoclass
. gsem (y1 y2 y3 y4 <- ), logit lclass(C 3)
. estimates store threeclass

Then you can obtain model-comparison statistics and tests by typing:

. lcstats

The lcstats command offers options for specifying which statistics and tests to report and to customize the look of the table.
- What makes it unique or exciting? - This has been the most requested addition to our latent class analysis features since they were first released. The tables produced by lcstats are automatically placed in a collection, which means they are very easy to further customize and export to a variety of file types by using the collect commands.
- Who will use it? - Researchers in behavioral sciences, health, and business often use latent class analysis, while social scientists and statisticians use finite mixture modeling. This feature will be useful to anyone performing these types of analysis.

Do-file Editor: Autocompletion, templates, and more — The Do-file Editor now includes:
- Autocompletion of variable names, macros, and stored results.
- Templates for saving time and ensuring consistency when creating new documents.
- Current word and selection highlighting, along with bracket highlighting.
- Code folding enhancements, including a "Fold all" option and less visually distracting markers.
- Temporary and permanent bookmarks to help navigate your do-file.
- Ability to show whitespace characters only within a selection.
- Navigator panel for easy access to permanent bookmarks and programs.
- What makes it unique and exciting? — These new features make Stata coding more efficient and improve the user experience when writing code.
- Who will use it? — All Stata users.

Graphics: Bar graph CIs, heat maps, and more — Stata 19 introduces several new graphics features:
- Heat maps using the new twoway heatmap command.
- Range and point plots with capped spikes and plain spikes for displaying values and confidence intervals.
- Bar graphs and dot charts with CIs, improved labeling, and grouping options via groupyvars.
- Box plots with improved labeling and grouping options.
- Color by variable option for more twoway plots.
- What makes it unique and exciting? - These are highly requested features, especially the confidence intervals and bar-grouping options.
- Who will use it? - Researchers across all disciplines who create graphs.

Tables: Easier tabulations, exporting, and more — Stata 19 introduces several new table features:
- Customize tables with titles, notes, and export to Word, LaTeX, Excel via the new title(), note(), and export() options.
- Create and format ANOVA tables with the new anova collection style and store results in r(ANOVA).
- Improved labels with collect get for better result labels and layout control.
- Control factor variables in table headers with the fvlevels() option.
- Tabulate with association measures and tests, customizable and exportable using collect().
- What makes it unique and exciting? - These features address frequent user requests for easier table creation and export.
- Who will use it? - Researchers in all disciplines who need to present results in tables.

Stata in French — Stata's menus and dialogs can now be displayed in French. If your computer language is set to French, Stata will automatically switch to this setting. You can also change the language manually via preferences or the set locale_ui command.
- Who will use it? - Researchers who speak French.

Additional new features — Stata 19 includes the following enhancements:
- Modify saved sets of frames
- Asymmetric Laplace likelihood for Bayesian models
- Alternative at-risk table for survival graphs
- PyStata enhancements
- Robust SEs for VAR models
- Half-Cauchy and Rayleigh priors for Bayesian analysis
- Bayesian predictions in user-defined evaluators
- Much more

For a complete list of available features, visit the Stata website.

Buy Stata 19

Buy online or contact our sales team for a customised quote based on:

Single and volume licenses.
Academic or commercial use.
Upgrades from previous versions.

Buy Stata 19 online

Stata training

Save yourself valuable time. Find out about available training courses and resources to become proficient in Stata.

New in Stata 19

Buy Stata 19

Stata training

Stata News