**SAS**

We use a suite of macros written by Jon Kosanke and Erik Bergstralh at the Mayo Clinic. The

`dist`macro calculates the pairwise distances between observations, while the

`vmatch`macro makes matches based on the distances, finding the closest set of matches while minimizing the overall distance. The latter macro uses the

`nobs`macro. The original macros are available for download from the Mayo Clinic Division of Biomedical statistics and Informatics. Slightly enhanced versions of the

`nobs`and

`vmatch`macros are available from the code section of the book website. The enhanced versions are used to generate the results below.

We previously created the

`propen`data set containing the

`propensity`score variable. In that entry we also noted a lack of overlap in the propensity distributions, and decided to drop observations with propensity > 0.8.

First we drop observations with propensity greater than 0.8 using the subsetting

`if`statement (section 1.5.1). This leaves 201 homeless subjects. We then read in the macros using the

`%include`statement (section 2.1.8). Then we run the

`dist`macro, using its option to call the

`vmatch`macro.

data prop2; set propen;

if propensity lt .8;

run;

%include "C:\ken\sasmacros\vmatch.sas";

%include "C:\ken\sasmacros\dist.sas";

%include "C:\ken\sasmacros\nobs.sas";

%dist(data=prop2, group=homeless, id=id, mvars=propensity,

wts=1, vmatch=Y, a=1, b=1, lilm=201, dmax=0.1,

outm=mp1_b, summatch=n, printm=N, mergeout=mpropen);

The macros are well documented by text included at the top of the macro, as is common with SAS macros that authors share. In the preceding code, the parameter values in the first line are relatively self-explanatory. The

`wts`parameter, which allows multiple matching variables to be weighted differently. The

`dmax`parameter specifies the maximum distance acceptable for a match. We arbitrarily decide that the propensities must be within 0.1 to make a match (our results will differ if other criteria were specified for the matching, as we will see below in the R comparison). The remaining parameters request the matching, ask for one and only one match per case, for all the cases to be matched (if possible), suppress printed output, and name the data set to contain output.

The output data set

`mpropen`is identical to the input data set, with the addition of a new indicator variable,

`matched`. We can compare the distribution of the covariates before and after the matching with the

`means`procedure (section 2.1.1).

title "observed data";

proc means data=propen mean;

class homeless;

var age female i1 mcs;

run;

title "matched observations";

proc means data=mpropen mean;

where matched;

class homeless;

var age female i1 mcs;

run;

observed data

N

HOMELESS Obs Variable Mean

-----------------------------------------------

0 244 AGE 35.0409836

FEMALE 0.2745902

I1 13.5122951

MCS 32.4868303

1 209 AGE 36.3684211

FEMALE 0.1913876

I1 23.0382775

MCS 30.7308549

-----------------------------------------------

matched observations

N

HOMELESS Obs Variable Mean

-----------------------------------------------

0 201 AGE 35.6218905

FEMALE 0.1791045

I1 15.9154229

MCS 31.4815123

1 201 AGE 36.1492537

FEMALE 0.1990050

I1 19.9452736

MCS 30.9176772

-----------------------------------------------

We see that the covariates are much better balanced after matching. Also note the dramatic impact of removing the 8 (4%) cases with very large propensities. This has reduced the mean number of drinks by 15%!

We can now performs the analysis on the matched cases. The two classes of homeless status now have nearly equal distributions of the probability of homelessness.

proc glm data=mpropen;

where matched;

model pcs = homeless / solution;

run;

Standard

Parameter Estimate Error t Value Pr > |t|

Intercept 48.95273471 0.76199823 64.24 <.0001

HOMELESS -1.79386398 1.07762823 -1.66 0.0968

After matching, the effect of homelessness on physical health is attenuated and has a larger p-value.

**R**

In R, the

`Matching`library provides tools for matching and analysis. The

`Match()`function implements a variety of algorithms for multivariate matching including propensity score, Mahalanobis and inverse variance matching. The function is intended to be used in conjunction with the

`MatchBalance()`function which determines the extent to which covariate balance has been achieved. The function takes the propensity score as an argument, as well as the outcome to be compared and the group indicators.

A wide variety of matching options include matching with or without replacement, bias adjustment, different methods for handling ties, exact and caliper matching. The GenMatch function can be used to automatically find balance via a genetic search algorithm which determines the optimal weight to give each covariate.

An extensive website describes the package and the many variety of options that it supports, and a related paper is forthcoming. Three extended examples are included to help illustrate the mechanics of matching.

library(Matching)

rr = Match(Y=pcs, Tr=homeless, X=X, M=1)

The function returns an object describing the matching.

names(rr)

[1] "est" "se" "est.noadj" "se.standard"

[5] "se.cond" "mdata" "index.treated" "index.control"

[9] "index.dropped" "weights" "orig.nobs" "orig.wnobs"

[13] "orig.treated.nobs" "nobs" "wnobs" "caliper"

[17] "ecaliper" "exact" "ndrops" "ndrops.matches"

[21] "MatchLoopC" "version" "estimand"

The results can be displayed by running

`summary()`.

summary(rr)

Estimate... -0.80207

AI SE...... 1.4448

T-stat..... -0.55516

p.val...... 0.57878

Original number of observations.............. 453

Original number of treated obs............... 209

Matched number of observations............... 209

Matched number of observations (unweighted). 252

By default, the observations are given equal weight. If all of the observations had a weight of 1 on input, then each matched-pair will have a weight of 1 on output if there are no ties.

We see that the causal estimate of -0.80 in the matched comparison is not statistically significant (p=0.58), which is consistent with the other results that accounted for the confounders (though we note that the specific results depend on the particular options that are selected for the matching).

The

`MatchBalance()`function can be used to describe the distribution

of the predictors (by homeless status) before and after matching (to save space, only the results for

`age`and

`i1`are displayed). This is helpful to determine if the

matching resulted in similar marginal distributions.

> MatchBalance(form, match.out=rr, nboots=10)

***** (V1) age ***** Before Matching After Matching

mean treatment........ 36.368 36.368

mean control.......... 35.041 36.423

std mean diff......... 16.069 -0.65642

mean raw eQQ diff..... 1.5981 0.94841

med raw eQQ diff..... 1 1

max raw eQQ diff..... 7 10

mean eCDF diff........ 0.037112 0.022581

med eCDF diff........ 0.026365 0.019841

max eCDF diff........ 0.10477 0.083333

var ratio (Tr/Co)..... 1.3290 1.2671

T-test p-value........ 0.070785 0.93902

KS Bootstrap p-value.. < 2.22e-16 0.3

KS Naive p-value...... 0.16881 0.34573

KS Statistic.......... 0.10477 0.083333

***** (V3) i1 ***** Before Matching After Matching

mean treatment........ 23.038 23.038

mean control.......... 13.512 20.939

std mean diff......... 40.582 8.945

mean raw eQQ diff..... 9.6316 2.1071

med raw eQQ diff..... 8 1

max raw eQQ diff..... 73 66

mean eCDF diff........ 0.11853 0.018753

med eCDF diff........ 0.12377 0.011905

max eCDF diff........ 0.20662 0.087302

var ratio (Tr/Co)..... 2.3763 1.3729

T-test p-value........ 7.8894e-07 0.011786

KS Bootstrap p-value.. < 2.22e-16 0.3

KS Naive p-value...... 0.00013379 0.29213

KS Statistic.......... 0.20662 0.087302

More details regarding each of the tests for differences in means or distributions can be found using

`?MatchBalance`. The results for both of the variables presented above indicate that the distributions are considerably closer

to each other in the matched sample than in the original dataset.

## 18 comments:

I am a new learner on statistics and I followed here from reading your great book.

I am comparing the average number of emergency visits between the people who enrolled in a program with the people who are not enrolled in.

My covariates include age, gender, sex, Risk score, number of emergency visits before the program, …

I followed your examples for the propensity score and the match. I also did the t-test with unequal variance for the continuous covariates between the two samples before the match and did the t-test with equal variance test for the covariates after the match.

Should I use the paired t-test to test the effectiveness of the program(the average number of emergency visits between the people who enrolled in a program and the people who are not enrolled in)?

Once again, thank you very much!

Unfortunately, we aren't able to give specific advice for your problem, as there's not one right answer as to how to undertake your analysis. I'm a bit fan of Paul Rosenbaum's Design of Observational Studies" book, which may be useful to consult.

Hey guys...just to encourage you to keep writing, I want you to know that I find this site very useful. I recently migrated to R from Stata.

Cheers.

Thanks, Anonymous. Please do send us questions or ideas for entries-- finding suitable and interesting material has been a bigger challenge than we realized.

Your book is fantastic and I love the indepth example driven companion website. This is truly service after the sale! Great job!

Thanks very much for the book and for the blog.

You mention your appreciation of Rosenbaum's _Design of Observational Studies_ In it, he describes the method of "direct adjustment" to address the challenges of summarizing data from matched studies with variable matching rates.

How closely does the use of the WEIGHT statement in SAS approach this method? Does it depend on the PROC?

Thanks!

Hello, about propensity weighted regression, there's a good article on SAS SUGI. It uses PROC SQL and use propensity in weight statement.

Just wanted to share that info.

Also, I wanted to ask a question on proensity matching. As I used macros, I keep getting this message:

ERROR: Variable ID not found.

when i correctly specified ID variable in macros. Does anyone have similar experience like this?

Thank you.

Hello, Dong-- Thanks for that information.

I'm not sure why you're encountering that error. Can you replicate our results here using the macros at the book web site and the data as generated in example 7.34?

Hi,

I found this website is very useful and tried to use the following sas codes you provided in example 7.35, yet, I got an error message in the log. Is there any suggestion that I may be wrong somewhere?

%dist(data=prop2, group=homeless, ..., mergeout=mpropen);

ERROR: The keyword parameter MERGEOUT was not defined with the macro.

Thank you.

Are you using the macro from Mayo, or the modified version I posted? I believe that option was added by me.

Ken,

You are right. I should use your modified version. Now everything is fine. Thank you!

Thank you very much for your work.

I got error because macro vmatch includes "proc assign" using SAS 9.2. Please let me know how to correct your code of vmatch.

Regret I don't have a version that does not use proc assign. You might ask the authors, at the Mayo Clinic, if they have such a version.

I believe proc assign is part of SAS/OR, though it is no longer documented. (http://support.sas.com/kb/42/143.html) So another solution would be to license the SAS/OR software, or find an installation that includes it.

Finally, you could use R to do the matching part of your job-- you can use the proc_r macro (http://sas-and-r.blogspot.com/2012/01/sas-macro-simplifies-sas-and-r.html) to do this smoothly and inside a single SAS program.

Thanks this website is so helpful and easy to understand

In Match(), I suppose p<0.05 means the data is well balanced?

What happens when the treatment is perfectly predictable given the covariates? If I am right, in this case, the propensity scores will be far apart and matching cannot be done?

I would say rather that matching _should_ not be done. (In these cases logistic regression fitting algorithms should provide some feedback, usually using language about complete or quasi-complete separation. They may not however, and I think that some will generate predicted probabilities (propensities) that might rather misleadingly suggest similar propensities.)

A too-casual approach would match regardless of this situation and fail to notice the complete separation. Something to be careful about!

In lay terms, I would explain this phenomenon by saying that some kinds of people choose one treatment, and other kinds of people choose the other, so that there are no cases of people with "the same" probability of choosing a given treatment, some of whom choose one and some the other treatment. Since the propensity score approach is motivated by assuming the actual treatment assignment in such cases is random, the propensity score analysis shouldn't be used in this setting.

Thank you for the SAS macro. Question. The N for the cases increases after matching. I have 31 cases, and after a 3 to 1 match, I now have 93 cases. Do I need to delete these duplicate observations for the cases before performing my analysis? My question really concerns the degrees of freedom and standard error for tests comparing the cases and controls after matching. Thanks in advance!

I was asked to match 35000 cases to 350 000 controls for an A/B study. Is this feasible using SAS propensity score matching or would other methods be more appropriate given the size of the datasets.

Thanks for answering.

Post a Comment