moreClarify

moreClarify is a new Stata package for transforming the raw output of regression models into any quantity of interest (QOI) through simulation and resampling methods. This approach has been discussed, among others, by Dowd, Greene and Norton (2014); Gelman and Hill (2007); King, Tomz and Wittenberg (2000) and Krinsky and Robb (1986).

The moreClarify package comprises three ado-files:

  • postsim simulates the main and ancillary parameters of a statistical model from their asymptotic sampling distribution (Gelman, et al, 2003, ch. 4). Specifically, typing

postsim exp_list, reps(#): command

executes command, and simulates reps(#) values of the parameters in exp_list from a multivariate normal distribution with vector of means e(b) and covariance matrix e(V)exp_list is optional and defaults to _b.

  • simqoi calculates predicted and expected values of the outcome variable using the simulated parameters previously stored by bootstrap or postsimsimqoi includes a wrapper for simulating differences and ratios of two expected values.
  • sumqoi summarizes QOIs stored as variables in the data in memory.

Users can further transform the simulations produced by simqoi into any function of interest through basic data manipulation. Point estimates and measures of uncertainty of those functions can be assessed with simple descriptive statistics (mean, median, standard deviation, centile-based confidence intervals, etc.). This technique is very convenient for tackling complex, non-linear functions of predicted or expected values that are difficult to derive with standard analytical methods (Gelman and Hill, 2007: ch. 7).

simqoi currently supports the following estimation commands:

Command Description
Linear regression models
cnsreg Constrained linear regression
eivreg Errors-in-variables regression
regress Linear regression
skewnreg Skew-normal linear regression
skewtreg Skew-t linear regression
tobit Tobit regression
Binary-response regression models
biprobit Bivariate probit regression
blogit Logistic regression for grouped data
bprobit Probit regression for grouped data
cloglog Complementary log-log regression
hetprob Heteroskedastic probit regression
logistic Logistic regression, reporting odds ratios
logit Logistic regression, reporting coefficients
probit Probit regression
scobit Skewed logistic regression
Discrete-response regression models
mlogit Multinomial (polytomous) logistic regression
ologit Ordered logistic regression
oprobit Ordered probit regression
Poisson regression commands
gnbreg Generalized negative binomial model
nbreg Negative binomial regression
poisson Poisson regression
zinb Zero-inflated negative binomial regression
zip Zero-inflated Poisson regression
Regression models with selection
heckman Heckman selection model
heckprob Probit model with sample selection

simqoi can be used in conjunction with svy estimation results; see [SVY] svy estimation.

moreClarify extends the popular Clarify package (Tomz, et al, 2003). Compared to the latter, moreClarify has the following additional features:

  • calculates QOIs simulating the parameters either from their asymptotic or empirical-finite sampling distribution (via post-estimation simulation or bootstrap replication, respectively).
  • supports a greater variety of regression models
  • can be used in conjunction with estimation commands for complex survey data (svy estimation)
  • allows factor-variables
  • makes easier to elaborate tables and graphs of results

Installation

moreClarify was written for Stata version 11.2 and above. It is publicly available for download from the Boston College Statistical Software Components (SSC) archive and can be installed within Stata by typing at the command line:

ssc install more_clarify

How to cite

Did you find this program helpful? If so, please cite:

Márquez Peña, Javier (2014), “moreClarify: Stata module to estimate quantities of interest through simulation and resampling methods,” Statistical Software Components S457851, Boston College Department of Economics.


References

The following are some useful references on the algorithm implemented by moreClarify:

  • Carsey, Thomas M., and Jeffrey J. Harden. Monte Carlo Simulation and Resampling Methods for Social Science. Sage, 2013. (chapter 9)
  • Dowd, Bryan E., William H. Greene, and Edward C. Norton. “Computation of Standard Errors.” Health Services Research (2014).
  • Gelman, Andrew, and Jennifer Hill. Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007. (chapter 7)
  • Herron, Michael C. “Postestimation Uncertainty in Limited Dependent Variable Models.” Political Analysis 8, no. 1 (1999): 83-98.
  • King, Gary, Michael Tomz, and Jason Wittenberg. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44, no. 2 (2000): 347-361.
  • Krinsky, Itzhak and A. Leslie Robb. “On Approximating the Statistical Properties of Elasticities.” The Review of Economics and Statistics 68, no. 4 (1986): 715-19.
  • ______. “On Approximating the Statistical Properties of Elasticities: A Correction.” The Review of Economics and Statistics 72, no. 1 (1990): 189-90.
  • ______. “Three Methods for Calculating the Statistical Properties of Elasticities: A Comparison.” Empirical Economics 16, no. 2 (1991): 199-209.
  • Mooney, Christopher Z. “Bootstrap Statistical Inference: Examples and Evaluations for Political Science.” American Journal of Political Science (1996): 570-602.
  • Mooney, Christopher Z., and Robert Duval. Bootstrapping: A Nonparametric Approach to Statistical Inference. No. 94-95. Sage, 1993.

Examples

Examples are presented under the following headings:

The datasets are available for download from my Dataverse.

The following sections briefly describe some features of moreClarify not covered in the previous examples:


Example: Hultman, Kathman and Shannon (2013)

In a recent article, Hultman et al. (2013) investigate why some United Nations (UN) peacekeeping operations are more successful than others in mitigating violence against civilians during civil wars. The authors argue that, due to logistical reasons, UN peacekeeping can protect civilians only if missions are composed of military troops and police in large numbers.

Hultman et al. test their hypothesis using monthly data on all intrastate armed conflicts in sub-Saharan Africa from 1991 to 2008. The outcome variable is the number of civilians killed in a conflict month by rebel groups or government forces (osvAll). The key explanatory variables are the lagged number of UN armed troops (troopLag) and UN police forces (policeLag) committed to a country in conflict. The authors fit a negative binomial model to adjust the relationship for other covariates that may be potential confounders.

Here, I reanalyze the model by Hultman et al. to illustrate how to implement the post-estimation simulation approach with the moreClarify package. I begin by replicating their model (Model 1, p. 884) as follows:

. use Peacekeeping, clear
. postsim, saving(sims) seed(1010): nbreg osvAll troopLag policeLag militaryobserversLag brv_AllLag  osvAllLagDum incomp epduration lntpop, cluster(conflict_id) nolog

Negative binomial regression                      Number of obs   =       3746
Dispersion           = mean                       Wald chi2(8)    =     468.16
Log pseudolikelihood = -6260.1541                 Prob > chi2     =     0.0000

                                   (Std. Err. adjusted for 36 clusters in conflict_id)
--------------------------------------------------------------------------------------
                     |               Robust
              osvAll |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------------+----------------------------------------------------------------
            troopLag |  -.5304471   .0855315    -6.20   0.000    -.6980857   -.3628084
           policeLag |  -9.902657   1.549289    -6.39   0.000    -12.93921   -6.866107
militaryobserversLag |   21.76187   3.956114     5.50   0.000     14.00803    29.51571
          brv_AllLag |   .0007062   .0006583     1.07   0.283    -.0005841    .0019965
        osvAllLagDum |   2.177359   .2810099     7.75   0.000      1.62659    2.728129
              incomp |   2.379391   .4264719     5.58   0.000     1.543521     3.21526
          epduration |  -.0005591   .0037848    -0.15   0.883    -.0079772    .0068589
              lntpop |   .7031065    .173572     4.05   0.000     .3629116    1.043301
               _cons |  -9.236753   2.084415    -4.43   0.000    -13.32213   -5.151374
---------------------+----------------------------------------------------------------
            /lnalpha |    2.82693   .2027178                      2.429611     3.22425
---------------------+----------------------------------------------------------------
               alpha |   16.89352   3.424617                      11.35446    25.13471
--------------------------------------------------------------------------------------
file sims.dta saved

The postsim prefix executes the regression command (nbreg) and simulates the model parameters from their asymptotic sampling distribution. The number of simulations can be specified with the reps() option, which in this case defaults to 1,000. The simulations are a function of the current random-number seed or the number specified with the seed() option. If you want to reproduce exactly the results of a previous set of simulations, you just have to set the corresponding seed.

The saving() option stored the simulations in the sims.dta Stata data file. This file contains 1,000 observations (the default number of simulations) for each parameter in the model:

. preserve
. use sims, clear
(postsim: nbreg)
. summarize 
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
osvAll_b_t~g |      1000   -.5321896    .0845418  -.7996467  -.1965389
osvAll_b_p~g |      1000   -9.915291    1.556912  -16.07831  -4.614525
osvAll_b_m~g |      1000    21.79973    3.967106   7.249298   34.71138
osvAll_b_b~g |      1000    .0007229    .0006738   -.001694   .0025403
osvAll_b_o~m |      1000    2.192715    .2894584   1.190763   3.124654
-------------+--------------------------------------------------------
osvAll_b_i~p |      1000    2.389948    .4435254   .9607835   3.638187
osvAll_b_e~n |      1000   -.0005191     .003814  -.0130663   .0105419
osvAll_b_l~p |      1000    .7097659    .1730003    .122183   1.290076
osvAll_b_c~s |      1000   -9.336153    2.104506  -15.77142  -1.562278
lnalpha_b_~s |      1000    2.838136    .2064805    2.16178   3.407831
. restore

As we can observe from the output above, the means and standard deviations of the simulated parameters are about the same as the coefficients and standard errors in the regression table. We might assess the precision by increasing the number of simulations.

The next step is to translate the simulated parameters into QOIs. The simqoi command calculates expected and predicted values of the outcome variable using the simulations stored by postsim. For example, we can simulate the expected number of civilian deaths (ev option, the default) holding all the covariates at their means (the default) by typing:

. simqoi using sims

Expected values                     Number of reps   =    1000
Expression   : E(osvAll)
--------------------------------------------------------------
             |                              Percentile-based  
      osvAll |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       _cons |     5.7447   .9170259      4.142742    7.895735
--------------------------------------------------------------

Evaluated at:
             |  troopLag  policeLag  militar~g  brv_All~g  osvAllL~m     incomp  epdurat~n     lntpop 
             |    (mean)     (mean)     (mean)     (mean)     (mean)     (mean)     (mean)     (mean) 
-------------+----------------------------------------------------------------------------------------
          at |   .781478   .0389405   .0335359   20.11292    .235534   1.682779   64.79607   9.332259 

The table at the top of the output displays the mean, standard error and centile-based confidence interval bounds of the simulated expected values. The table at the bottom shows the values of the covariates.

Similarly, we can simulate the predicted number of civilian deaths (pv option) holding the number of UN armed troops and police forces at zero with the at() option:

. simqoi using sims, seed(1111) pv at( (mean) _all  (zero) troopLag policeLag)

Predicted values                    Number of reps   =     661
Expression   : Pred(osvAll)
--------------------------------------------------------------
             |                              Percentile-based  
      osvAll |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       _cons |   21.51286   81.54334             0         164
--------------------------------------------------------------

Evaluated at:
             |  troopLag  policeLag  militar~g  brv_All~g  osvAllL~m     incomp  epdurat~n     lntpop 
             |    (zero)     (zero)     (mean)     (mean)     (mean)     (mean)     (mean)     (mean) 
-------------+----------------------------------------------------------------------------------------
          at |         0          0   .0335359   20.11292    .235534   1.682779   64.79607   9.332259 

The at() option also allows to simulate QOIs at different values of one or more covariates. For example, the authors assess the effect of UN missions by estimating the expected number of civilian deaths as the number of UN military troops increase from 0 to 8 thousands (troopLag=(0/8)). They posit a scenario in which a war is fought over government control (incomp=2), and violence against civilians was committed in the previous month (osvAllLagDum=1). We might replicate their results as follows:

. simqoi using sims, ev at( (mean) _all incomp=2 osvAllLagDum=1 troopLag=(0/8) )

Expected values                     Number of reps   =    1000
Expression   : E(osvAll)
--------------------------------------------------------------
             |                              Percentile-based  
      osvAll |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
_at_1        |
       _cons |   101.8548   28.74764      57.66963    171.1605
-------------+------------------------------------------------
_at_2        |
       _cons |   59.67529   16.18271      34.16832    97.22126
-------------+------------------------------------------------
_at_3        |
       _cons |   35.21611   10.16581      19.26688    59.29638
-------------+------------------------------------------------
_at_4        |
       _cons |   20.93311   6.938258      10.66947    37.58585
-------------+------------------------------------------------
_at_5        |
       _cons |   12.53392   4.934629      5.600584    24.69425
-------------+------------------------------------------------
_at_6        |
       _cons |   7.559982   3.562636      2.875887    16.15812
-------------+------------------------------------------------
_at_7        |
       _cons |   4.593689   2.584239      1.431772    10.73503
-------------+------------------------------------------------
_at_8        |
       _cons |   2.812184   1.879528      .7172948    7.296161
-------------+------------------------------------------------
_at_9        |
       _cons |   1.734646   1.372339      .3577345    4.943484
--------------------------------------------------------------

Evaluated at:
             |  troopLag  policeLag  militar~g  brv_All~g  osvAllL~m     incomp  epdurat~n     lntpop 
             |  (values)     (mean)     (mean)     (mean)    (value)    (value)     (mean)     (mean) 
-------------+----------------------------------------------------------------------------------------
        at_1 |         0   .0389405   .0335359   20.11292          1          2   64.79607   9.332259 
        at_2 |         1   .0389405   .0335359   20.11292          1          2   64.79607   9.332259 
        at_3 |         2   .0389405   .0335359   20.11292          1          2   64.79607   9.332259 
        at_4 |         3   .0389405   .0335359   20.11292          1          2   64.79607   9.332259 
        at_5 |         4   .0389405   .0335359   20.11292          1          2   64.79607   9.332259 
        at_6 |         5   .0389405   .0335359   20.11292          1          2   64.79607   9.332259 
        at_7 |         6   .0389405   .0335359   20.11292          1          2   64.79607   9.332259 
        at_8 |         7   .0389405   .0335359   20.11292          1          2   64.79607   9.332259 
        at_9 |         8   .0389405   .0335359   20.11292          1          2   64.79607   9.332259 

simqoi stores the results in r(), so they can be accessed into subsequent commands. This feature is very helpful for creating tables and graphs for publication or presentations. For example, I created a table in LaTeX using the r(Results) matrix and the outtable command:

matrix out = r(Results)'
outtable using textable, mat(out) nobox norow caption("Expected number of civilian deaths as UN troops increase")

Hultman et al. prefer to depict the effect of UN military troops in a graphical manner. We can convert the matrix of results into variables in the data in memory and plot the results with the following snippet of code:

svmat out, names(col)
generate xaxis = (_n*1000)-1000 in 1/9
twoway (rarea lower upper xaxis) (line b xaxis)

peacekeeping

Back to Top


Example: Dowd, Greene and Norton (2014)

Dowd et al. (2014) illustrate three techniques for computing standard errors of nonlinear functions of estimated parameters:

  • the delta method,
  • the Krinsky-Robb method, and
  • bootstrapping.

I recommend that you read this article before using moreClarify. Here, I provide some code for replicating the results of the Krinsky-Robb and bootstrap methods via the moreClarify package. (The delta method is implemented in Stata via the margins command.)

Setup:

. webuse margex
(Artificial data for margins)

Predicted probability that y=1 for a 50-year-old woman (Krinsky-Robb)

. set seed 12345
. postsim, saving(sims) reps(1000): logit outcome c.age##i.sex, nolog

Logistic regression                               Number of obs   =       3000
                                                  LR chi2(3)      =     562.70
                                                  Prob > chi2     =     0.0000
Log likelihood = -1084.7241                       Pseudo R2       =     0.2060
------------------------------------------------------------------------------
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |    .110599    .010689    10.35   0.000      .089649     .131549
         sex |
     female  |     1.3517    .622081     2.17   0.030     .1324438    2.570957
   sex#c.age |
     female  |  -.0104589   .0130144    -0.80   0.422    -.0359667    .0150489
       _cons |  -7.030922   .5024759   -13.99   0.000    -8.015757   -6.046088
------------------------------------------------------------------------------
file sims.dta saved

. simqoi using sims, at(age=50 sex=1)

Expected values                     Number of reps   =    1000
Expression   : Pr(outcome)
--------------------------------------------------------------
             |                              Percentile-based  
     outcome |Probability   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
           1 |   .3373971   .0144265      .3091538    .3648118
--------------------------------------------------------------

Evaluated at:
             |       age     0b.sex      1.sex 
             |   (value)    (value)    (value) 
-------------+---------------------------------
          at |        50          0          1 

Predicted probability that y=1 for a 50-year-old woman (Bootstrap)

. set seed 12345
. bootstrap, saving(boots) reps(1000): logit outcome c.age##i.sex, nolog
(running logit on estimation sample)

Bootstrap replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100
(output omitted)
..................................................   950
..................................................  1000

Logistic regression                             Number of obs      =      3000
                                                Replications       =      1000
                                                Wald chi2(3)       =    392.94
                                                Prob > chi2        =    0.0000
Log likelihood = -1084.7241                     Pseudo R2          =    0.2060
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |    .110599   .0104509    10.58   0.000     .0901156    .1310824
         sex |
     female  |     1.3517   .6160967     2.19   0.028     .1441728    2.559228
   sex#c.age |
     female  |  -.0104589   .0129793    -0.81   0.420    -.0358978      .01498
       _cons |  -7.030922   .4912338   -14.31   0.000    -7.993723   -6.068122
------------------------------------------------------------------------------

. simqoi using boots, at(age=50 sex=1)

Expected values                     Number of reps   =    1000
Expression   : Pr(outcome)
--------------------------------------------------------------
             |                              Percentile-based  
     outcome |Probability   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
           1 |   .3386969   .0142557      .3108789    .3658603
--------------------------------------------------------------

Evaluated at:
             |       age     0b.sex      1.sex 
             |   (value)    (value)    (value) 
-------------+---------------------------------
          at |        50          0          1 

The partial effect of age on the probability that y=1 for a 50-year-old woman (Krinsky-Robb).

. simqoi using sims, diff at(age=(50 51) sex=1)

First differences                   Number of reps   =    1000
Expression   : Pr(outcome|at_2) - Pr(outcome|at_1)
--------------------------------------------------------------
             |                              Percentile-based  
     outcome |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
           1 |   .0226869   .0018562      .0191297    .0263356
--------------------------------------------------------------

Evaluated at:
             |       age     0b.sex      1.sex 
             |  (values)    (value)    (value) 
-------------+---------------------------------
        at_1 |        50          0          1 
        at_2 |        51          0          1 

The partial effect of age on the probability that y=1 for a 50-year-old woman (Bootstrap).

. simqoi using boots, diff at(age=(50 51) sex=1)

First differences                   Number of reps   =    1000
Expression   : Pr(outcome|at_2) - Pr(outcome|at_1)
--------------------------------------------------------------
             |                              Percentile-based  
     outcome |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
           1 |   .0228742   .0018176      .0192884    .0266424
--------------------------------------------------------------

Evaluated at:
             |       age     0b.sex      1.sex 
             |  (values)    (value)    (value) 
-------------+---------------------------------
        at_1 |        50          0          1 
        at_2 |        51          0          1 

Predicted probability that y=1 averaged over the sample (Krinsky-Robb).

. scalar N = _N
. gen sim_prob = 0 in 1/1000
(2000 missing values generated)

. forvalues i = 1/`=N' {
  2.         qui simqoi using sims, at(age=`=age[`i']' sex=`=sex[`i']') gen(foo)
  3.         qui replace sim_prob = sim_prob + (foo/N)
  4.         drop foo
  5. }

. sumqoi sim_prob
--------------------------------------------------------------------
            |                                    Percentile-based   
   Variable |       Obs       Mean    Std Dev  [95% Conf  Interval] 
------------+-------------------------------------------------------
   sim_prob |      1000   .1698177   .0060776   .1577188    .181734 
--------------------------------------------------------------------

The partial effect of age on the probability that y=1 averaged over the sample (Krinsky-Robb).

. gen sim_diff = 0 in 1/1000
(2000 missing values generated)

. forvalues i = 1/`=N' {
  2.         qui simqoi using sims, diff at(age=(`=age[`i']' `=age[`i']+1')  sex=`=sex[`i']') gen(foo)
  3.         qui replace sim_diff = sim_diff + (foo/N)
  4.         drop foo
  5. }

. sumqoi sim_diff
--------------------------------------------------------------------
            |                                    Percentile-based   
   Variable |       Obs       Mean    Std Dev  [95% Conf  Interval] 
------------+-------------------------------------------------------
   sim_diff |      1000   .0120133   .0006221   .0107428   .0131852 
--------------------------------------------------------------------

Back to Top


Example: Gelman and Hill (2007: ch. 7)

Gelman and Hill illustrate how simulation helps to estimate any complex function of predicted values. As an example, they posit a model to predict the district-by-district election outcome in the 1990 US House elections. The QOI is the predicted number of districts to be won by the Democrats.

Gelman and Hill first build a model to predict the 1988 election and then use the results for predicting the 1990 election. The data comprises the electoral outcomes in 416 districts contested in the 1990 election. The outcome variable is the Democratic share of the two-party vote in 1988 (vote_88). The predictors are the Democratic vote share in the previous election (vote_86) and an indicator variable for incumbency status (incumbency_88). They restrict the estimation sample to contested districts in the 1988 election.

. use Congress, clear
(Gelman and Hill (2007), Section 7.3))
. set seed 1029
. postsim, saving(sims): reg vote_88 vote_86 incumbency_88 if vote_88 > .1 & vote_88 < .9

      Source |       SS       df       MS              Number of obs =     343
-------------+------------------------------           F(  2,   340) = 1211.32
       Model |  10.9727504     2  5.48637522           Prob > F      =  0.0000
    Residual |  1.53994656   340  .004529255           R-squared     =  0.8769
-------------+------------------------------           Adj R-squared =  0.8762
       Total |   12.512697   342  .036586833           Root MSE      =   .0673

-------------------------------------------------------------------------------
      vote_88 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
      vote_86 |   .5832759   .0350453    16.64   0.000      .514343    .6522088
incumbency_88 |   .0770516    .007036    10.95   0.000      .063212    .0908912
        _cons |   .2019981   .0181997    11.10   0.000        .1662    .2377963
-------------------------------------------------------------------------------
file sims.dta saved

The next step is to predict the 1990 Democratic vote share using the results of the model above and a new set of values for the predictors. First, I create a new variable (seats) that will contain the simulations:

. set obs 1000
obs was 416, now 1000

. generate seats = 0

Then, I simulate predicted values for each district contested in 1990. The variable containing the previous vote now corresponds to the 1988 election, and the incumbency variable now corresponds to the incumbency status in 1990 (incumbency_90). I loop over districts by subscripting the variables:

. forvalues i = 1/416 {
  2.         qui simqoi using sims, pv gen(foo) at(vote_86=`=vote_88[`i']' incumbency_88=`=incumbency_90[`i']')
  3.         qui replace seats = seats + 1 if foo > .5
  4.         drop foo
  5. }

. sumqoi seats, statistics(median)
-------------------------------------------------------------------------------
            |                                               Percentile-based   
   Variable |       Obs       Mean    Std Dev        p50  [95% Conf  Interval] 
------------+------------------------------------------------------------------
      seats |      1000    247.734   2.977265        248        242        253 
-------------------------------------------------------------------------------

At the first iteration of the loop, the simqoi command generates 1,000 predicted values for the first district in the dataset. Each simulation is a predicted Democrat vote-share for this district. For each simulation, I add a 1 to the variable seats if the Democratic vote share is greater than 0.5; that is, if the Democrat candidate wins the seat. Then, the loop repeats this procedure for each contested district in 1990. In this way, at the end of the loop, the seats variable contains 1, 000 simulations of the predicted number of districts won by the Democrats.

The predicted median is 248 seats with a 95% confidence interval that goes from 242 to 253. We could approximate the entire predictive distribution by plotting the simulated values in a histogram:

histogram seats, discrete

congress
Back to Top


Example: Brambor, Clark and Golder (2006)

Brambor et al. (2006) discuss how to properly interpret the parameters of multiplicative interaction models. The authors offer an example that addresses the effect of presidential elections on legislative fragmentation (Golder, 2006). The hypothesis is that presidential elections may reduce the number of political parties in the legislature if the number of presidential candidates is sufficiently low. The mechanism is a “coattails effect”; that is, the ability of presidential candidates to attract votes for legislative candidates of their same party.

Brambor et al. present the results of a normal-linear model fitted with data from 528 legislative elections held in 51 established democracies since 1946 to 2000. The outcome variable is the effective number electoral of parties (ENEP) and the key explanatory variable is a continuous measure of the temporal proximity of presidential elections (proximity). The effective number of presidential candidates (pres_cand) modifies the effect of proximity on ENEP. The model also includes some control variables related to social heterogeneity and electoral system characteristics. I replicated their results and simulated the parameters with the postsim prefix:

. use Coattails
. generate log_magnitude=ln(magnitude)
(13 missing values generated)
. set seed 12345
. postsim, saving(sims): regress ENEP c.proximity##c.pres_cand c.ethnic##c.log_magnitude, robust cluster(country) noheader
                                           (Std. Err. adjusted for 51 clusters in country)
------------------------------------------------------------------------------------------
                         |               Robust
                    ENEP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------------+----------------------------------------------------------------
               proximity |  -3.442498   .4949233    -6.96   0.000     -4.43658   -2.448415
               pres_cand |   .2914018   .1664983     1.75   0.086    -.0430199    .6258235
                         |
 c.proximity#c.pres_cand |   .8186213   .2192167     3.73   0.000     .3783116    1.258931
                         |
                  ethnic |    .101851   .1405416     0.72   0.472    -.1804351    .3841371
           log_magnitude |  -.0449729   .2380302    -0.19   0.851    -.5230707    .4331248
                         |
c.ethnic#c.log_magnitude |    .345453   .1791566     1.93   0.060    -.0143936    .7052996
                         |
                   _cons |   3.014188   .3312822     9.10   0.000     2.348788    3.679588
------------------------------------------------------------------------------------------
file sims.dta saved

Brambor et al. stress that the coefficient of proximity must not be interpreted as the effect of a change in temporal proximity of presidential elections on legislative fragmentation. Instead, it indicates the effect of presidential elections when the number of presidential candidates is zero (pres_cand= 0). Naturally, a substantive QOI should convey the effect of presidential elections for a more meaningful number of presidential candidates.

The authors present a figure that graphically shows the marginal effect (the difference of two expected values) of presidential elections across the observed range of pres_cand (from 1 to 6):

forvalues i=1/6 {
        simqoi using sims, diff at( proximity=(0 1) pres_cand=`i' )
}

The loop executes the command enclosed in braces over different values of pres_cand. To save the results, we might add a line of code to create a matrix named out:

forvalues i=1/6 {
	simqoi using sims, diff at( proximity=(0 1) pres_cand=`i' )
	mat out = nullmat(out) \ r(Results)'
}

Finally, we can graph the results by converting the matrix out into variables in the dataset in memory:

svmat out, names(col)
generate xaxis = _n in 1/6 
twoway (line b xaxis) (rline lower upper xaxis, lpattern(dash)), yline(0)

coattails
Back to Top


Example: Jackman (1994)

Most countries elect their legislatures in single-member districts. This type of electoral rules tend to exaggerate majorities for the winning party. For example, party A may receive 65 percent of the seats in the legislature with 55 percent of the national vote. A recurrent concern among political scientists is if this “exaggeration effect” is symmetric for all parties; that is, if party B, with the same vote share than party A, would also receive 65 percent of the seats. If the result is asymmetric, the electoral system is biased.

Jackman (1994) estimates the partisan bias in Australia using data from elections for the House of Representatives during 1949-1993. He posits the following model which characterizes the relationship between seats and votes in two-party systems:

\begin{aligned}  \ln\left(\frac{s}{1-s}\right)=\ln\beta+\rho\ln\left(\frac{v}{1-v}\right)+e  \end{aligned}

where s and v are a party’s proportion of seats and votes, respectively; and \rho and \ln\beta are unknown parameters that measure the exaggeration effect and bias of the electoral system, respectively. The vote share required to obtain 50 per cent of the seats is defined as:

\begin{aligned}  v*=\frac{\exp\left(-\frac{\widehat{\ln\beta}}{\hat{\rho}}\right)}{1+\exp\left(-\frac{\widehat{\ln\beta}}{\hat{\rho}}\right)}  \end{aligned}

In an unbiased two-party system, v*=0.5.

Mooney (1996) discusses some advantages of bootstrap over traditional techniques for this and other similar applications. First, the standard error of v* cannot be estimated using analytic approaches. Post-estimation simulation may be helpful in such cases, but it assumes that \hat{\rho} and \widehat{\ln\beta} are normally distributed. Mooney (1996, p. 587) notes that in order to this assumption to hold, the stochastic component of the model must be log-normally distributed. However, we have no reasons to expect that this would be the case. To overcome this problem, Mooney recommends to draw values from the empirical finite sampling distribution of v* through bootstrap.

Below, I replicate Jackman’s analysis. The outcome variable is the Australian Labor Party’s share of seats (seats) and the explanatory variable is the ALP’s vote share (vote), both transformed into the logit scale. I also include into the model an user-specified constant term (mycons) that is equal to 1 for all observations.

. use Australia.dta
. gen seats=logit(alp_seats/nseats)
. gen vote=logit(alp_2pp/100)
. gen mycons=1
. set seed 12345
. bootstrap, reps(1000) saving(boots) nodots: regress seats vote mycons, nocons

Linear regression                               Number of obs      =        19
                                                Replications       =      1000
                                                Wald chi2(2)       =    103.00
                                                Prob > chi2        =    0.0000
                                                R-squared          =    0.8716
                                                Adj R-squared      =    0.8565
                                                Root MSE           =    0.1628
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
       seats |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        vote |   3.043835   .3280878     9.28   0.000     2.400794    3.686875
      mycons |  -.0856601   .0364848    -2.35   0.019     -.157169   -.0141512
------------------------------------------------------------------------------

The bootstrap command stores the replicated parameters in a new Stata data file named boots.dta. We can compute v* directly using the replications in the boots.dta data file, but my aim here is to show how to use simqoi in conjunction with bootstrap. Simulating \hat{\rho} and \widehat{\ln\beta} is equivalent to simulate two expected values: one holding vote=0 and mycons=1, and other holding holding vote=1 and mycons=0:

. simqoi using boots, at(vote=0 mycons=1) generate(ln_beta) noattable

Expected values                     Number of reps   =    1000
Expression   : E(seats)
--------------------------------------------------------------
             |                              Percentile-based  
       seats |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       _cons |  -.0862786   .0364848     -.1574563    -.014482
--------------------------------------------------------------

. simqoi using boots, at(vote=1 mycons=0) gen(rho) noattable

Expected values                     Number of reps   =    1000
Expression   : E(seats)
--------------------------------------------------------------
             |                              Percentile-based  
       seats |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       _cons |   3.072609   .3280878      2.474461    3.766798
--------------------------------------------------------------

The generate() option generates two variables named rho and ln_beta corresponding to \hat{\rho} and \widehat{\ln\beta}. Then, I simply apply the appropriate function to obtain v^{*}:

. gen v_star=invlogit(-ln_beta/rho)
. sumqoi v_star
--------------------------------------------------------------------
            |                                    Percentile-based   
   Variable |       Obs       Mean    Std Dev  [95% Conf  Interval] 
------------+-------------------------------------------------------
     v_star |      1000   .5071631   .0032958   .5011344   .5143192 
--------------------------------------------------------------------

The results indicate that the Labor party requires nearly 51% of the votes in order to gain 50% of the seats. This means that the electoral system is slightly biased against the ALP. We can test the hypothesis that the system is effectively biased against the ALP, v*>0.5, by counting the fraction of replicates that exceed 0.5:

. count if v_star>.5
  988

That is, \mathrm{Prob}\left(v*>0.5\right)=988/1000=0.988, which indicates that we cannot reject the null hypothesis. We could also approximate the entire distribution of v^{*} by plotting the simulated values in a histogram.

. histogram v_star

australia
Back to Top



Example: DeBell (2010)

moreClarify supports statistical models for complex survey data. Stata’s svy suite of survey data commands includes procedures to accommodate one or more stages of clustering, stratification, sampling weights, finite-population correction, post-stratification and sub-population estimation. For introductory treatments of the design-based approach for survey data, I refer the reader to Lee and Forthofer (2006) and Heeringa et al. (2010)Kreuter and Valliant (2007) and Kolenikov (2010) describe some of the capabilities of Stata for design-based analysis of survey data.

I illustrate some basic features of moreClarify by replicating a logit model for vote choice in the 2008 United States presidential election (DeBell, 2010). The data come from The American National Election Studies (ANES), which consists in a series of pre-election and post-election interviews collected with a complex survey design. The aim of the original analysis is to show the differences that emerge when analyzing ANES data with and without design-based corrections. Full code for replicating the results is provided in DeBell (2010).

I begin by declaring the survey design for the dataset:

. use ANES_Example

. svyset [pweight=V080102], strata(V081206) psu(V081205)
      pweight: V080102
          VCE: linearized
  Single unit: missing
     Strata 1: V081206
         SU 1: V081205
        FPC 1: 

The dependent variable (voteobama) is coded 1 if the respondent voted for Barack Obama in the 2008 general election and 0 if voted for another candidate. The model contains a rich set of predictors, but for the sake of brevity I focus on party identification.

. set seed 3030
. postsim, saving(sims): svy, subpop(subp): logit voteobama pid obamaft mccainft bibleWOG gaymil iraqworth obsinfo edu female black
(running logit on estimation sample)

Survey: Logistic regression
Number of strata   =         9                  Number of obs      =      2182
Number of PSUs     =        94                  Population size    = 1974.5504
                                                Subpop. no. of obs =      1424
                                                Subpop. size       = 1462.9686
                                                Design df          =        85
                                                F(  10,     76)    =     23.64
                                                Prob > F           =    0.0000
------------------------------------------------------------------------------
             |             Linearized
   voteobama |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         pid |  -3.679856   .3892411    -9.45   0.000    -4.453772   -2.905941
     obamaft |   8.226276   1.312318     6.27   0.000     5.617037    10.83551
    mccainft |  -5.352353   .7729222    -6.92   0.000    -6.889129   -3.815577
    bibleWOG |  -1.194492    .546729    -2.18   0.032    -2.281536   -.1074487
      gaymil |   .8651314   .3676627     2.35   0.021     .1341195    1.596143
   iraqworth |  -.7052859   .3645434    -1.93   0.056    -1.430096     .019524
     obsinfo |   1.033456   .6568545     1.57   0.119    -.2725467    2.339459
         edu |  -4.239279   1.189411    -3.56   0.001    -6.604146   -1.874412
      female |   .0142554   .2524267     0.06   0.955    -.4876365    .5161473
       black |   2.635305   .6767807     3.89   0.000     1.289684    3.980926
       _cons |   2.932714    1.00858     2.91   0.005     .9273872    4.938041
------------------------------------------------------------------------------
file sims.dta saved

The postsim command estimates the model and simulates the parameters. The simqoi command simulates the probability of voting for Obama (the expected value of the outcome variable). The at() option sets the values of the explanatory variables, first holding all variables at their means, and then varying party identification from “Strong Democrat” (pid=0 ) to “Strong Republican” (pid=1).

. levelsof pid, local(valpid)
0 .1666666716337204 .3333333432674408 .5 .6666666865348816 .8333333134651184 1

. simqoi using sims, at( (mean) _all pid=(`valpid') )

Expected values                     Number of reps   =    1000
Expression   : Pr(voteobama)
--------------------------------------------------------------
             |                              Percentile-based  
   voteobama |Probability   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
_at_1        |
       _cons |   .9042008   .0202621      .8608949    .9375672
-------------+------------------------------------------------
_at_2        |
       _cons |   .8377886    .025283      .7868602    .8822375
-------------+------------------------------------------------
_at_3        |
       _cons |   .7378628   .0295382      .6798552     .792517
-------------+------------------------------------------------
_at_4        |
       _cons |   .6047336   .0340827       .537995    .6684731
-------------+------------------------------------------------
_at_5        |
       _cons |   .4539979   .0395787      .3777684    .5333161
-------------+------------------------------------------------
_at_6        |
       _cons |   .3118633   .0422884      .2332206    .3985795
-------------+------------------------------------------------
_at_7        |
       _cons |   .1988416   .0392367      .1308382    .2862526
--------------------------------------------------------------

Evaluated at:
             |       pid    obamaft   mccainft   bibleWOG     gaymil  iraqworth    obsinfo        edu     female      black 
             |  (values)     (mean)     (mean)     (mean)     (mean)     (mean)     (mean)     (mean)     (mean)     (mean) 
-------------+--------------------------------------------------------------------------------------------------------------
        at_1 |         0   .5774391   .5187053   .6027796    .737661    .248444   .6334342   .7940228   .5513064   .1223376 
        at_2 |  .1666667   .5774391   .5187053   .6027796    .737661    .248444   .6334342   .7940228   .5513064   .1223376 
        at_3 |  .3333333   .5774391   .5187053   .6027796    .737661    .248444   .6334342   .7940228   .5513064   .1223376 
        at_4 |        .5   .5774391   .5187053   .6027796    .737661    .248444   .6334342   .7940228   .5513064   .1223376 
        at_5 |  .6666667   .5774391   .5187053   .6027796    .737661    .248444   .6334342   .7940228   .5513064   .1223376 
        at_6 |  .8333333   .5774391   .5187053   .6027796    .737661    .248444   .6334342   .7940228   .5513064   .1223376 
        at_7 |         1   .5774391   .5187053   .6027796    .737661    .248444   .6334342   .7940228   .5513064   .1223376 

By default, simqoi calculates the values of the covariates using the weights specified in the estimation command. If you request the noweights option, the weights are ignored.

The following code replicates the analysis without accounting for the sample design and the graph below contrasts the results. As can be observed, not accounting for the survey design overestimates the probability of voting for Obama.

postsim, saving(sims, replace): logit voteobama pid obamaft mccainft bibleWOG gaymil iraqworth obsinfo edu female black
levelsof pid, local(valpid)
simqoi using sims, at( (mean) _all pid=(`valpid') )

prob_obama
Back to Top


Example: Williams and Whitten (2012)

Williams and Whitten (2012) find that researchers often neglect the dynamic specification of regression models with an autorregressive term. As a consequence, they miss opportunities for exploring long term-effects for substantively interesting scenarios.

As an illustration, Williams and Whitten re-analyze the work by Poe and Tate (1994). Poe and Tate sought to explain variations in the levels of governmental repression in countries around the world. They focus on governmental repression of personal integrity rights, which include murder, torture, forced disappearance and imprisonment of persons for their political views.

Poe and Tate test several hypothesis on the causes of human rights abuse using a sample of 153 countries for the period 1980-1987. The outcome variable (ainew) is an index tapping torture, imprisonment and political killings and executions from a content analysis of Amnesty International reports. The model contains a rich set of predictors, but for brevity I shall focus in the effect of economic development measured as GNP per capita (pcgnp). The authors fit a normal linear model with robust standard errors and a lagged dependent variable. I replicate their results as follows:

. use Repression, clear
. set seed 54321
. postsim, saving(sims): reg ainew L.ainew polrtnew lpop popinc pcgnp pcginc i.left i.milctr2 i.britinfl i.iwar i.cwar, robust
Linear regression                                      Number of obs =    1071
                                                       F( 11,  1059) =  401.04
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.7740
                                                       Root MSE      =  .53021
------------------------------------------------------------------------------
             |               Robust
       ainew |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ainew |
         L1. |   .7295838   .0235022    31.04   0.000     .6834676    .7757001
             |
    polrtnew |  -.0447288   .0127771    -3.50   0.000    -.0698001   -.0196576
        lpop |   .0531552   .0096192     5.53   0.000     .0342804      .07203
      popinc |   .0081702   .0122863     0.66   0.506    -.0159381    .0322784
       pcgnp |   -.007612   .0031574    -2.41   0.016    -.0138075   -.0014165
      pcginc |  -.0009254   .0010606    -0.87   0.383    -.0030066    .0011558
      1.left |  -.0347249    .053811    -0.65   0.519    -.1403133    .0708634
   1.milctr2 |   .0456474   .0470973     0.97   0.333    -.0467672    .1380621
  1.britinfl |   -.029736    .039875    -0.75   0.456    -.1079791     .048507
      1.iwar |   .2077699   .0791685     2.62   0.009     .0524249    .3631149
      1.cwar |   .3270248   .0770346     4.25   0.000     .1758669    .4781827
       _cons |  -.0212486    .162264    -0.13   0.896    -.3396441    .2971469
------------------------------------------------------------------------------
file sims.dta saved

Williams and Whitten recommend to estimate the long-term effects of explanatory variables simulating the conditional distribution of the outcome variable over several time periods. Their preferred method is statistical simulation. As they acknowledge, the trick to convert short-term effects into long-term effects with simulation is fairly simple: we begin by simulating a value of the outcome variable at an initial value; then, the resulting value is used as the value for the lagged outcome variable in the next period, and so on.

In a companion article (Williams and Whitten, 2011), the authors provide a wrapper for Clarify that automates this task. We can reach the same results using the simqoi command and a loop (Cox, 2002). Suppose we are interested in the expected value of repression of human rights over eight years holding all continuous variables at their mean and all factor (dummy) variables at their base level (zero):

local lag  2.434 //Initial lag
forvalues i = 1/9 { //Loop over periods
        simqoi using sims, at((mean) _cont (base) _factor L.ainew=(`lag'))
        local lag = el(r(b), 1, 1) //lag for next period
}

The local macro (lag) defined at the beginning of the code contains the initial value of the lagged outcome variable. The second line executes the commands enclosed in braces over nine periods of time. The third line simulates an expected value holding the lagged value of the outcome variable at the value contained in the local lag while fixing all the other covariates at their means. Finally, the fourth line substitutes the local macro lag with the resulting expected value contained in the matrix r(b). This value will be used as the lagged value at the next iteration, and so on.

Williams and Whitten elaborate a more intricate example varying economic development (pcgnp) at its minimum, mean and maximum in the sample:

foreach stat in min mean max { //Loop over scenarios
        local lag  2.434 //Initial lag
        forvalues i = 1/9 { //Loop over periods
                simqoi using sims, at((mean) _cont (base) _factor (`stat') pcgnp L.ainew=(`lag'))
                mat out = nullmat(out) \ r(Results)' //save results
                local lag = el(r(b), 1, 1) //lag for next period
        } 
}

The first line loops over the values of economic development, which are plugged into the at() option of the simqoi command. The line after the simqoi command defines the matrix out, which stores the results at each iteration.

As in the previous examples, we can plot the results by converting the matrix out into variables in the dataset in memory:

. svmat out, names(col)
. egen xaxis = seq() in 1/27, from(1) to(9)
(1197 missing values generated)
. twoway  (rcapsym lower upper xaxis in 1/9) ///
>         (rcapsym lower upper xaxis in 10/18) ///
>         (rcapsym lower upper xaxis in 19/27)

repression

Williams and Whitten use this graph to stress the virtues of dynamic simulation. Since the model includes an autorregresive term, the coefficient of pcgnp only indicates the immediate effect of that variable on the outcome variable at a given period of time (De Boef and Keele, 2008). The graph leads to substantively conclusions that are overlooked if we interpret the coefficients in the standard way. Note that a country with the maximum level of economic development is statistically indistinguishable from the other two hypothetical countries in year 1. But in the long-run, economic development has a substantive effect in diminishing repression only in the country with the maximum level of economic standing.

Back to Top


Example: Hanmer and Ozan Kalkan (2013)

In the previous examples, I showed how to compute QOIs varying the variable of interest while holding other covariates at fixed values. The most common practice is to hold the other covariates at their mean. However, Hanmer et al. (2013) argue that this “average case” approach can produce illogical, misleading or substantively meaningless results.

The authors advocate the use of what they call the “observed case” approach. This technique involves holding covariates at the observed values for each observation in the sample, calculating the relevant QOI, and then average the results over all the observations (see Gelman and Pardoe, 2007).

Hanmer et al. (2013) illustrate the differences between both approaches with a probit model for vote choice in the 2004 United States Presidential election. The data come from the American National Election Studies (ANES) survey. The outcome variable (presvote) is the vote choice between George W. Bush (coded as 1) and John Kerry (coded as 0). The model contains a rich set of predictors, but for brevity I shall focus in the effect of retrospective economic evaluations (retecon).

I fitted the model and simulated the parameters with the postsim command:

. use Retrospective, clear
. set seed 1029
. postsim, saving(sims): probit presvote retecon partyid bushiraq ideol7b white female age educ1_7 income, nolog

Probit regression                                 Number of obs   =        383
                                                  LR chi2(9)      =     409.91
                                                  Prob > chi2     =     0.0000
Log likelihood = -60.413645                       Pseudo R2       =     0.7723
------------------------------------------------------------------------------
    presvote |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     retecon |   .8713392   .2728416     3.19   0.001     .3365795    1.406099
     partyid |   .5830335   .0869011     6.71   0.000     .4127106    .7533565
    bushiraq |   2.130149   .3800328     5.61   0.000     1.385299       2.875
     ideol7b |   .1805497   .1355239     1.33   0.183    -.0850723    .4461717
       white |    .091142   .3083371     0.30   0.768    -.5131875    .6954716
      female |  -.0805012   .2590734    -0.31   0.756    -.5882758    .4272733
         age |  -.0035819   .0083966    -0.43   0.670    -.0200389    .0128751
     educ1_7 |  -.0761311   .1027982    -0.74   0.459    -.2776118    .1253496
      income |   .0144323   .0272993     0.53   0.597    -.0390733    .0679379
       _cons |  -2.611197   .8943763    -2.92   0.004    -4.364142   -.8582515
------------------------------------------------------------------------------
file sims.dta saved

The code that follows simulates the average probability of voting for George W. Bush across the range of economic evaluations holding other covariates at their actual values. First, I create five variables that will contain the QOI for each category of economic evaluations.

. set obs 1000
obs was 383, now 1000
. gen pr1 = 0 in 1/1000
. gen pr2 = 0 in 1/1000
. gen pr3 = 0 in 1/1000
. gen pr4 = 0 in 1/1000
. gen pr5 = 0 in 1/1000

Then, I simulate the predicted probability of voting for George W. Bush for each observation (n=383) with a loop:

forvalues i=1/383 {
	simqoi using sims, ev gen(foo) at( retecon=(-1(.5)1) partyid=`=partyid[`i']' bushiraq=`=bushiraq[`i']' ideol7b=`=ideol7b[`i']' white=`=white[`i']' female=`=female[`i']' age=`=age[`i']' educ1_7=`=educ1_7[`i']' income=`=income[`i']')
	forvalues k=1/5 {
		replace pr`k' = pr`k' + (foo_at_`k'*(1/383))
	}
	drop foo_*
}

The nested loop adds up the simulated probabilities of each observation (one for each category of retecon) to the variables I previously defined. These probabilities are weighted by a normalized weight, which is equivalent to add all the probabilities up and divide by the total number of observations. Then, we can summarize the variables that contain the average expected values with the sumqoi command:

. sumqoi pr1 pr2 pr3 pr4 pr5, format(%9.3g)
--------------------------------------------------------------------
            |                                    Percentile-based   
   Variable |       Obs       Mean    Std Dev  [95% Conf  Interval] 
------------+-------------------------------------------------------
        pr1 |      1000       .445      .0279       .384       .498 
        pr2 |      1000       .488      .0161       .455       .518 
        pr3 |      1000       .533      .0159       .505       .566 
        pr4 |      1000       .584      .0305       .531       .647 
        pr5 |      1000       .642      .0508        .55       .743 
--------------------------------------------------------------------

Contrast the results with the expected value at the mean; that is, the probability of voting for George W. Bush across the range of economic evaluations holding the covariates at their means (the so called “average case” approach):

. simqoi using sims, at(  retecon=(-1(.5)1) ) noattable

Expected values                     Number of reps   =    1000
Expression   : Pr(presvote)
--------------------------------------------------------------
             |                              Percentile-based  
    presvote |Probability   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
_at_1        |
           1 |   .3579818   .0967849      .1883508    .5669234
-------------+------------------------------------------------
_at_2        |
           1 |    .520966    .066425      .3967423    .6530514
-------------+------------------------------------------------
_at_3        |
           1 |   .6835008   .0535773      .5752257    .7820286
-------------+------------------------------------------------
_at_4        |
           1 |   .8129118   .0614659       .681004    .9161316
-------------+------------------------------------------------
_at_5        |
           1 |   .8973886   .0612192      .7555417    .9798493
--------------------------------------------------------------

As Hanmer et al. (2013) note, the results of the observed and average case approaches might differ substantially. For example, using the average case approach, the probability of voting for Bush for someone who believes the economy is much better (retecon=1) is almost 90%. In contrast, using the observed case approach, the probability is 64%.

Finally, we can replicate Figure 3 in Hanmer et al. (2013: 275) with the following snippet of code:

. generate mworse = pr1-pr3
. generate sworse = pr2-pr3
. generate sbetter = pr4-pr3
. generate mbetter = pr5-pr3

. sumqoi mworse sworse sbetter mbetter, normal
--------------------------------------------------------------------
            |                                      Normal-based     
   Variable |       Obs       Mean    Std Dev  [95% Conf  Interval] 
------------+-------------------------------------------------------
     mworse |      1000  -.0885323   .0316273  -.1505206   -.026544 
     sworse |      1000  -.0453928   .0161388  -.0770242  -.0137614 
    sbetter |      1000   .0508781   .0190163   .0136069   .0881493 
    mbetter |      1000   .1082135   .0408644   .0281208   .1883063 
--------------------------------------------------------------------

. matrix out = r(Results)
. svmat out, names(matcol)
. gen xaxis = _n in 1/4
(996 missing values generated)
. twoway (rspike outlb outub xaxis) (scatter outmean xaxis), yline(0)

Retrospective
Back to Top


Example: Precision

By default, postsim simulates 1,000 random values from the asymptotic sampling distribution of the parameters. The approximation to this distribution becomes more accurate as we increase the number of simulations.

Tomz et al. (2003) recommend to verify that the means of the simulated parameters are equal to the estimated parameters within the desired degree of precision. postsim stores the means and covariance matrix of the simulated parameters in r(b) and r(V), respectively. Thus, one can assess the difference between the estimated parameters in e(b) and the means of the simulated parameters in r(b) with a simple matrix operation. For example, consider the following Heckman selection model to explain the wage of women:

. webuse womenwk
. qui postsim, seed(2222): heckman wage educ age, select(married children educ age)
. mat check = e(b)-r(b)
. matlist check

             | wage_b_~n  wage_b_~e  wage_b_~s  select_~d  select_~n  select_~n  select_~e 
-------------+-----------------------------------------------------------------------------
        mean |  .0000226   .0004118  -.0229195   .0017437   -.001196   .0000458  -.0002126 

             | select_~s  athrho_~s  lnsigma~s 
-------------+---------------------------------
        mean |   .008097   .0029233   .0008365 

Now, suppose our tolerated level of precision is two decimal points:

. matlist check, format(%9.2f)

             | wage_b_~n  wage_b_~e  wage_b_~s  select_~d  select_~n  select_~n  select_~e 
-------------+-----------------------------------------------------------------------------
        mean |      0.00       0.00      -0.02       0.00      -0.00       0.00      -0.00 

             | select_~s  athrho_~s  lnsigma~s 
-------------+---------------------------------
        mean |      0.01       0.00       0.00 

We can assess the precision by increasing the number of simulations to, say, 5,000:

. qui postsim, seed(1357) reps(5000): heckman wage educ age, select(married children educ age)
. matlist e(b)-r(b), format(%9.2f)

             | wage_b_~n  wage_b_~e  wage_b_~s  select_~d  select_~n  select_~n  select_~e 
-------------+-----------------------------------------------------------------------------
        mean |      0.00      -0.00       0.00      -0.00      -0.00      -0.00       0.00 

             | select_~s  athrho_~s  lnsigma~s 
-------------+---------------------------------
        mean |      0.00       0.00       0.00 

Back to Top


Example: Bootstrap for survey data

The moreClarify package also supports bootstrap for survey data. Bootstrap for complex survey data requires special resampling procedures currently implemented in Stata via the user-written bsweights command (Kolenikov, 2010). Consider the following example using the NHANES II dataset. The aim is to assess the risk factors for high blood pressure. I created bootstrap weights with 500 replications, declared the survey design, and simulated the expected value of the outcome variable with the following snippet of code:

webuse nhanes2
generate cstrata = floor(sqrt(2*strata-1))
egen upsu = group(strata psu)
svyset upsu [pw=finalwgt], strata(cstrata)
bsweights bw, reps(500) n(-1) seed(10101) dots
svyset [pw=finalwgt], vce(bootstrap) bsrweight(bw*)
svy bootstrap, saving(boots) nodots: logistic highbp height weight age female
simqoi using boots

Back to Top


Example: Multiple imputation

The easiest way to simulate QOIs from multiply imputed data is via the Stata mi suite of commands. For example, consider the fictional heart attack data used in the Multiple-Imputation Reference Manual:

. webuse mheart1s20
(Fictional heart attack data; bmi missing)

. mi describe
  Style:  mlong
          last mi update 07feb2013 13:05:30, 189 days ago
  Obs.:   complete          132
          incomplete         22  (M = 20 imputations)
          ---------------------
          total             154
  Vars.:  imputed:  1; bmi(22)
          passive:  0
          regular:  5; attack smokes age female hsgrad
          system:   3; _mi_m _mi_id _mi_miss
         (there are no unregistered variables)

The data includes 20 imputations. We can estimate a model to analyze the relationship between heart attacks and smoking adjusted for other factors with the mi estimate prefix command:

. postsim, saving(sims): mi estimate, post: logit attack smokes age bmi hsgrad female

Multiple-imputation estimates                     Imputations     =         20
Logistic regression                               Number of obs   =        154
                                                  Average RVI     =     0.0312
                                                  Largest FMI     =     0.1355
DF adjustment:   Large sample                     DF:     min     =    1060.38
                                                          avg     =  223362.56
                                                          max     =  493335.88
Model F test:       Equal FMI                     F(   5,71379.3) =       3.59
Within VCE type:          OIM                     Prob > F        =     0.0030
------------------------------------------------------------------------------
      attack |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      smokes |   1.198595   .3578195     3.35   0.001     .4972789    1.899911
         age |   .0360159   .0154399     2.33   0.020     .0057541    .0662776
         bmi |   .1039416   .0476136     2.18   0.029      .010514    .1973692
      hsgrad |   .1578992   .4049257     0.39   0.697    -.6357464    .9515449
      female |  -.1067433   .4164735    -0.26   0.798    -.9230191    .7095326
       _cons |  -5.478143   1.685075    -3.25   0.001    -8.782394   -2.173892
------------------------------------------------------------------------------
file sims.dta saved

mi estimate estimates the parameters from the multiply imputed data and adjusts coefficients and standard errors for the variability between imputations. The post option requests that the multiple imputed estimates be posted in e(b) and e(V). This allows to postsim to simulate the multiply imputed parameters from their asymptotic sampling distribution.

The esample() option of the mi estimate prefix can be useful to simulate QOIs at specific values of the covariates. For example, suppose that we want to simulate the expected value holding all the covariates at their means:

. mi convert flong
. qui postsim, saving(sims, replace): mi estimate, post esample(esample): logit attack smokes age bmi hsgrad female

The esample() option creates a new variable identifying which observations were used in the estimation. We may include this new variable in the simqoi command to restrict the computations to the estimation sample:

. simqoi using sims if esample

Expected values                     Number of reps   =    1000
Expression   : Pr(attack)
--------------------------------------------------------------
             |                              Percentile-based  
      attack |Probability   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
           1 |   .4431237    .043053      .3591407    .5348867
--------------------------------------------------------------

Evaluated at:
             |    smokes        age        bmi     hsgrad     female 
             |    (mean)     (mean)     (mean)     (mean)     (mean) 
-------------+-------------------------------------------------------
          at |  .4155844   56.48829   25.24463   .7532468   .2467532 

Back to Top


Example: Transformed outcome variable

Simulating QOIs from a regression model with a transformed outcome variable requires special attention by the user. Suppose we fit a log-linear model in which the outcome variable Y is transformed into \ln(Y). As in the previous examples, we estimate and simulate the parameters with the postsim command:

. sysuse auto
(1978 Automobile Data)
. generate lnmpg=ln(mpg)
. postsim, saving(sims) seed(1234): regress lnmpg weight foreign

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  2,    71) =   98.42
       Model |  3.63136813     2  1.81568407           Prob > F      =  0.0000
    Residual |  1.30985522    71  .018448665           R-squared     =  0.7349
-------------+------------------------------           Adj R-squared =  0.7274
       Total |  4.94122335    73  .067687991           Root MSE      =  .13583
------------------------------------------------------------------------------
       lnmpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |  -.0003189   .0000254   -12.55   0.000    -.0003695   -.0002682
     foreign |  -.1028503   .0428956    -2.40   0.019    -.1883816   -.0173189
       _cons |   4.017887   .0863317    46.54   0.000     3.845746    4.190027
------------------------------------------------------------------------------
file sims.dta saved

If we are interested in computing the predicted value in the original scale, we only need to apply the inverse function \exp(\ln(Y)) to the simulated predicted values:

. simqoi using sims, pv at(foreign=0) gen(ypred) seed(4444)

Predicted values                    Number of reps   =    1000
Expression   : Pred(lnmpg)
--------------------------------------------------------------
             |                              Percentile-based  
       lnmpg |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       _cons |   3.053642   .1412604      2.769845    3.327075
--------------------------------------------------------------

Evaluated at:
             |    weight    foreign 
             |    (mean)    (value) 
-------------+----------------------
          at |  3019.459          0 

. replace ypred = exp(ypred)
(1000 real changes made)

. sumqoi ypred
--------------------------------------------------------------------
            |                                    Percentile-based   
   Variable |       Obs       Mean    Std Dev  [95% Conf  Interval] 
------------+-------------------------------------------------------
      ypred |      1000   21.40353   3.014663   15.95637   27.85674 
--------------------------------------------------------------------

However, simulating expected values requires a little bit of effort, since exponentiating the simulated expected values yields the wrong answer: E(\ln(Y))\neq\ln(E(Y)) (King et al., 2000).

Assume we wish to simulate M=1,000 expected values, and each simulated expected value is the average of T=500 predicted values. We can simulate M\times T=500,000 predicted values at once and apply the inverse function \exp(\ln(Y)) to obtain the predicted values in the natural scale of Y:

. qui postsim, saving(sims, replace) reps(500000) seed(1111): regress lnmpg weight foreign

. simqoi using sims, pv at(foreign=0) gen(tmp) seed(2134)

Predicted values                    Number of reps   =  500000
Expression   : Pred(lnmpg)
--------------------------------------------------------------
             |                              Percentile-based  
       lnmpg |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       _cons |   3.055287   .1394414      2.781054    3.329541
--------------------------------------------------------------

Evaluated at:
             |    weight    foreign 
             |    (mean)    (value) 
-------------+----------------------
          at |  3019.459          0 

. replace tmp = exp(tmp)
(500000 real changes made)

Then, we can average the 500,000 transformed predicted values in blocks of 500 to get 1,000 expected values:

. egen M = seq(), from(1) to(1000) block(500)
. preserve
. collapse (mean) tmp, by(M)
. sumqoi tmp
--------------------------------------------------------------------
            |                                    Percentile-based   
   Variable |       Obs       Mean    Std Dev  [95% Conf  Interval] 
------------+-------------------------------------------------------
        tmp |      1000   21.43464   .1339256   21.14932   21.69798 
--------------------------------------------------------------------
. restore

This example is for illustrative purposes only. In particular, if you wish to fit a log-linear model, probably you should fit a Poisson model (see Silva and Tenreyro, 2006). However, this example conveys the fundamentals for simulating expected values with a transformed outcome variable.

Back to Top


Example: Exposures and offsets

The simqoi command simulates QOIs holding exposure and offset variables (if specified in the estimation command) at their mean. Consider the following poisson model to explain coronary disease deaths among male British doctors. I simulate the expected number of deaths using the simqoi command:

. webuse dollhill3
(Doll and Hill (1966))

. postsim, saving(sims): poisson deaths smokes i.agecat, exposure(pyears) nolog

Poisson regression                                Number of obs   =         10
                                                  LR chi2(5)      =     922.93
                                                  Prob > chi2     =     0.0000
Log likelihood = -33.600153                       Pseudo R2       =     0.9321
------------------------------------------------------------------------------
      deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      smokes |   .3545356   .1073741     3.30   0.001     .1440862     .564985
             |
      agecat |
      45-54  |   1.484007   .1951034     7.61   0.000     1.101611    1.866403
      55-64  |   2.627505   .1837273    14.30   0.000     2.267406    2.987604
      65-74  |   3.350493   .1847992    18.13   0.000     2.988293    3.712693
      75-84  |   3.700096   .1922195    19.25   0.000     3.323353     4.07684
             |
       _cons |  -7.919326   .1917618   -41.30   0.000    -8.295172   -7.543479
  ln(pyears) |          1  (exposure)
------------------------------------------------------------------------------
file sims.dta saved

. simqoi using sims

Expected values                     Number of reps   =    1000
Expression   : E(deaths)
--------------------------------------------------------------
             |                              Percentile-based  
      deaths |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       _cons |   73.43471   4.458152      65.66918    82.41131
--------------------------------------------------------------

Evaluated at:
             |    smokes  1b.agecat   2.agecat   3.agecat   4.agecat   5.agecat  ln(pyea~) 
             |    (mean)     (mean)     (mean)     (mean)     (mean)     (mean)   (offset) 
-------------+-----------------------------------------------------------------------------
          at |        .5         .2         .2         .2         .2         .2   9.806244 

Note that simqoi simulated the expected value holding the exposure at ln (mean (pyears)) = 9.8:

. sum pyears
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
      pyears |        10     18146.7     17762.4       1462      52407

. di ln(r(mean))
9.806244

simqoi does not allow to set values for offset or exposure variables because those variables are not included in e(b) after estimation:

. simqoi using sims, at(pyears=18000)
'pyears' not found in list of covariates
r(322);

If you want to hold offset or exposure variables at other values distinct from the mean, you might re-estimate the model using constraints. For example:

. generate ln_pyears=ln(pyears)
. constraint 1 _b[ln_pyears]==1
. postsim, saving(sims, replace): poisson deaths smokes i.agecat ln_pyears, constraint(1) nolog

Poisson regression                                Number of obs   =         10
                                                  Wald chi2(5)    =     658.34
Log likelihood = -33.600155                       Prob > chi2     =     0.0000

 ( 1)  [deaths]ln_pyears = 1
------------------------------------------------------------------------------
      deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      smokes |   .3545354   .1073741     3.30   0.001      .144086    .5649848
             |
      agecat |
      45-54  |   1.484007   .1951034     7.61   0.000     1.101611    1.866402
      55-64  |   2.627505   .1837273    14.30   0.000     2.267406    2.987604
      65-74  |   3.350493   .1847992    18.13   0.000     2.988293    3.712692
      75-84  |   3.700096   .1922195    19.25   0.000     3.323353     4.07684
             |
   ln_pyears |          1  (constrained)
       _cons |  -7.919326   .1917618   -41.30   0.000    -8.295172   -7.543479
------------------------------------------------------------------------------
file sims.dta saved

Constrained estimation produces exactly the same results than above. But now ln_pyears is a covariate, and we can ascribe a value to it. Say, for example, that we want to calculate the expected value holding pyears at its maximum (52407):

. di ln(52407)
10.866795

. simqoi using sims, at(ln_pyears=10.866795)

Expected values                     Number of reps   =    1000
Expression   : E(deaths)
--------------------------------------------------------------
             |                              Percentile-based  
      deaths |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       _cons |   212.4839   12.78912       188.662    238.2317
--------------------------------------------------------------

Evaluated at:
             |    smokes  1b.agecat   2.agecat   3.agecat   4.agecat   5.agecat  ln_pyears 
             |    (mean)     (mean)     (mean)     (mean)     (mean)     (mean)    (value) 
-------------+-----------------------------------------------------------------------------
          at |        .5         .2         .2         .2         .2         .2   10.86679 

Back to Top


Example: Reparameterizations

simqoi allows to simulate the model parameters in a scale distinct from the scale used at the estimation stage.

Consider the skewnreg command (Marchenko and Genton, 2010), which estimates the parameters in a centered metric, but allows to recover the parameters in their natural scale (postdp option). The scale parameter, \omega, must be greater than zero; so, it may be convenient to simulate \omega using an unbounded scale such as \ln\left(\omega\right). The postsim command facilitates this task:

. use ais
(Biological measures from athletes at the Australian Institute of Sport)
. set seed 9876
. postsim _b lnomega = ln([omega]_cons), saving(sims): skewnreg bmi bfat ssf, postdp nolog

Skew-normal regression                            Number of obs   =        202
                                                  Wald chi2(2)    =     119.41
Log likelihood =  -452.4138                       Prob > chi2     =     0.0000
------------------------------------------------------------------------------
         bmi |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        bfat |  -.7473673   .0969447    -7.71   0.000    -.9373754   -.5573591
         ssf |   .1612525   .0173687     9.28   0.000     .1272105    .1952946
       _cons |   19.18597   .3935734    48.75   0.000     18.41458    19.95736
-------------+----------------------------------------------------------------
       alpha |     3.0108   .7449337                      1.550757    4.470843
-------------+----------------------------------------------------------------
       omega |   3.626689   .2673555                      3.102682    4.150696
------------------------------------------------------------------------------
LR test vs normal regression:        chi2(1) =    18.41   Prob > chi2 = 0.0000
file sims.dta saved

The postsim command includes an expression, lnomega=ln([omega]_cons), which instructs Stata to simulate \ln\left(\omega\right) and store the simulations in a variable named lnomega.

Remember to reparameterize back before using simqoi:

. preserve 
. use sims, clear
(postsim: skewnreg)
. sumqoi _all
--------------------------------------------------------------------
            |                                    Percentile-based   
   Variable |       Obs       Mean    Std Dev  [95% Conf  Interval] 
------------+-------------------------------------------------------
 bmi_b_bfat |      1000  -.7445524   .0996439  -.9471111  -.5543838 
  bmi_b_ssf |      1000   .1606853   .0178495   .1278581   .1970901 
 bmi_b_cons |      1000   19.19352   .3927834   18.42276   19.96162 
alpha_b_c~s |      1000   3.003553    .767325   1.457948   4.500031 
omega_b_c~s |      1000   3.627206   .2743605   3.058911   4.154224 
_eq4_lnom~a |      1000   1.288463   .0756504   1.131765   1.433779 
--------------------------------------------------------------------

. replace omega_b_cons = exp(_eq4_lnomega)
(997 real changes made)
. drop _eq4_lnomega
. save sims, replace
file sims.dta saved
. restore
. simqoi using sims

Expected values                     Number of reps   =    1000
Expression   : E(bmi)
--------------------------------------------------------------
             |                              Percentile-based  
         bmi |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       _cons |   22.95108   .1823269      22.62017    23.27793
--------------------------------------------------------------

Evaluated at:
             |      bfat        ssf 
             |    (mean)     (mean) 
-------------+----------------------
          at |  13.50743   69.02178 

Back to Top

4 thoughts on “moreClarify

  1. I am estimating a zero inflated negative binomial. I used the program to extract the predict counts, but cannot figure out how to extract the inflate stage predicted probabilities. Can you offer some advice. best, gb

    • Hi!
      Unfortunately, the probability of a zero outcome is only an intermediate step in moreClarify and it is not reported. However, you can calculate Pr(y=0) using margins:

      .margins, atmeans predict(pr)

  2. Thank you for this. I am running a seemingly unrelated biprobit (Y1=Y2+Z) (Y2=Z)
    When I use more_clarify, I obtain the specified predicted probabilities for Y1 by default. Is there a way of obtaining separate probabilities for different joint scenarios for Y1 and Y2 (e.g. Y1=1 and Y2=1, Y1=0 and Y2=1 etc.)
    thank you

Leave a Reply