# Building the Evidence Base for the Medical Home: What Sample and Sample Size Do Studies Need?

**AHRQ Publication No.**11-0100-EF

**Prepared For:**Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services, 540 Gaither Road, Rockville, MD 20850, www.ahrq.gov

**Contract Number:**HHSA290200900019I TO2

**Prepared by:**Mathematica Policy Research Princeton, NJ; Deborah Peikes, Mathematica Policy Research; Stacy Dale, Mathematica Policy Research; Eric Lundquist, Mathematica Policy Research; Janice Genevro, Agency for Healthcare Research and Quality; and David Meyers, Agency for Healthcare Research and Quality

## Appendix A

### AHRQ Definition of the Medical Home

The medical home model holds promise as a way to improve health care in America by transforming how primary care is organized and delivered. Building on the work of a large and growing community, the Agency for Healthcare Research and Quality (AHRQ) defines a medical home not simply as a place but as a model of the organization of primary care that delivers the core functions of primary health care (Agency for Healthcare Research and Quality).

The medical home encompasses five functions and attributes:

**Patient-centered:** The primary care medical home provides relationship-based primary health care that is oriented toward the “whole person.” Partnering with patients and their families requires understanding and respecting each patient’s unique needs, culture, values, and preferences. The medical home practice actively supports patients in learning to manage and organize their own care at the level the patient chooses. Recognizing that patients and families are core members of the care team, medical home practices ensure that they are fully informed partners in establishing care plans.

**Comprehensive care:** The primary care medical home is accountable for meeting the bulk of each patient’s physical and mental health care needs, including prevention and wellness, acute care, and chronic care. Comprehensive care requires a team of care providers, possibly including physicians, advanced practice nurses, physician assistants, nurses, pharmacists, nutritionists, social workers, educators, and care coordinators. Although some medical home practices may bring together large and diverse teams of care providers to meet the needs of their patients, many others, including smaller practices, will build virtual teams linking themselves and their patients to providers and services in their communities.

**Coordinated care:** The primary care medical home coordinates care across all elements of the broader health care system, including specialty care, hospitals, home health care, and community services and supports. Such coordination is particularly critical during transitions between sites of care, such as when patients are being discharged from the hospital. Medical home practices also excel at building clear and open communication among patients and families, the medical home, and members of the broader care team.

**Superb access to care:** The primary care medical home delivers accessible services with shorter waiting times for urgent needs, enhanced in-person hours, around-the-clock telephone or electronic access to a member of the care team, and alternative methods of communication such as email and telephone care. The medical home practice is responsive to patients’ preferences regarding access.

**A systems-based approach to quality and safety:** The primary care medical home demonstrates a commitment to quality and quality improvement by ongoing engagement in activities such as using evidence-based medicine and clinical decision-support tools to guide shared decisionmaking with patients and families, engaging in performance measurement and improvement, measuring and responding to patient experiences and patient satisfaction, and practicing population health management. Publicly sharing robust quality and safety data and improvement activities is also an important marker of a system-level commitment to quality.

AHRQ recognizes the central role of health IT in successfully operationalizing and implementing the key features of the medical home. In addition, AHRQ notes that building a primary care delivery platform that the Nation can rely on for accessible, affordable, high-quality health care will require significant workforce development and fundamental payment reform. Without these critical elements, the potential of primary care will not be achieved.

back to top

## Appendix B

### Calculating Minimum Detectable Effects and Effective Sample Sizes

This appendix describes the factors used to calculate minimum detectable effects (MDEs) in practice-level interventions. Chapter 1 describes the general approach for calculating MDEs, Chapter 2 explains how to tailor the MDE calculation for a practice-level intervention, and Chapter 3 describes how to calculate effective sample sizes after accounting for clustering.

#### General Approach for Calculating MDEs

In general, the MDE of a research design depends on several factors:

- The policymaker’s comfort level with the chance of erroneously concluding the medical home works when it really does not (the false positive rate, called the significance level, the Type I error rate, or α).
- The comfort level with incorrectly concluding the medical home does not work when it does (the false negative rate, or Type II error rate, called β, where (1-β) is the power level.
- The number of degrees of freedom (denoted as
*df*), a measure of the number of independent pieces of information that are analyzed in estimating the effects of an intervention on outcomes, in the statistical model used to estimate the intervention impact. - The standard error of the impact estimate (
*SE*), the extent to which sample impact estimates could vary from the true program impact over repeated samples.

The coefficient of variation (CV), a measure of the variation, or noise, in the outcome measure, defined as the standard deviation (σ) of the outcome measure divided by the mean (μ).

The MDE of an experimental design, expressed in terms of percentage changes in the outcome measure,^{12} can be written generally as:

In this equation, M is the standard error multiplier, a constant that depends on the study’s chosen false positive rate or significance level, the statistical power level (1-β), and the number of degrees of freedom. Researchers commonly set the significance level to 5 or 10 percent and the power level to 80 percent. A statistical power level of 80 percent implies that a study will fail to detect a true effect of a given size (and commit a Type II, or false negative, error) with a 20 percent probability. While 80 percent power and a 5 percent or 10 percent significance level are conventional, the chosen statistical power and significance level of a research design should be determined by the relative discomfort with making Type I and Type II errors in each specific context. Researchers sometimes relax statistical significance requirements in health care service interventions delivered at the practice or clinic level where adequate power is difficult to achieve and a Type I error would not be as problematic as in, for example, certain types of clinical trials.

#### B. Tailoring MDE Calculations for Practice-Level Interventions

The medical home model is a practice-level intervention that involves changing the way all patients within a practice or clinic are served. As a result, studies generally designate a number of sites that will receive the intervention, and then select analogous units for the comparison group. When evaluating these interventions, researchers need to assess the extent of clustering in the data to accurately calculate the standard errors and p-values, which determine whether a finding is statistically significant. In common medical home interventions that target entire practices, the study design must account for clustering at the practice level, or the extent to which individual outcomes are correlated within practices. If alternative implementation models target different levels of organization, such as practice sites or individual clinicians or teams, clustering must be assessed at the same logical level.^{13} It is important to design studies in which the intervention and comparison groups are constructed at the same level as the implementation model, to prevent cross-contamination or spillover effects (Bloom, 2005). Contamination might occur if some clinicians within a practice were in an evaluation’s intervention group and some were in the comparison group, and the intervention clinicians shared with comparison clinicians ideas about ways practice patterns might be changed.

In many common study designs, such as the medical home, entire practices are selected into intervention and comparison groups, and mean patient outcomes are compared after the intervention begins. In this type of study, with one level of clustering (that is, patients are clustered within practices), MDEs can be expressed as^{14}:

Equation 2 illustrates how several facets of the research design influence the MDEs in clustered evaluations like the medical home. It is important to understand the benefits and drawbacks of manipulating these different study elements, especially relative to the costs of or savings from implementing associated changes. Below, we describe each factor in the equation and how it affects the MDE.

**Coefficient of Variation (CV).** As noted above, the CV is a measure of the variation, or noise, in the outcome measure (standard deviation divided by mean). Ideally, the CV will be low, because this leads to a smaller MDE. Intuitively, a lower CV means there is less random variation (or “noise”) in the outcome measure, which makes it easier to attribute differences in outcomes between intervention and comparison groups to the intervention itself; conversely, if the CV is high, it is difficult to distinguish the effect of the intervention from noise in the outcome.

**2. The Standard Error Multiplier (M).** This is a constant determined by the chosen statistical significance and power level of the study (as discussed above), as well as the regression model’s degrees of freedom.^{15} The number of degrees of freedom is defined as the number of independent pieces of information or observations that go into the calculation of a given statistic, minus the number of intermediate parameters used in the model. In clustered studies, individual observations are not independent of one another but rather correlated within groups (practices), so the number of independent observations is conservatively defined as the total number of study practices rather than the total number of study patients. Therefore, the number of degrees of freedom in clustered studies where intervention and control means are compared to assess program impacts is equal to the total number of study practices minus the number of practice-level covariates inIcluded in the impact estimation model, minus 2. Commonly used rule-of-thumb multipliers are 2.5 (two-tailed test: 10 percent significance; 80 percent power; at least roughly 30 degrees of freedom) or 2.8 (two-tailed test; 5 percent significance; 80 percent power; at least roughly 30 degrees of freedom). Increasing the degrees of freedom, as well as relaxing power and significance requirements, will lead to decreases in the study’s MDE.

**The Intracluster Correlation Coefficient (ICC).** The ICC is a measure of clustering within groups, such as practices. It is defined as the ratio of group-level outcome variance to total outcome variance. The MDE of a study will decrease as the ICC falls. If patients within practices tend to have similar outcomes and average patient outcomes tend to differ across practices, then the ICC will be high, and it will be difficult to tell whether outcome differences are due to the intervention or are simply the result of which practices were included in the intervention and comparison groups. Conversely, if patient outcomes are similar across practices, then the ICC (and therefore the MDE) will be relatively low. Patient-level data, with patients attributed to practices, are needed to estimate the ICCs for each study outcome measure (see Appendix F for sample code for calculating ICCs).

**The Number of Intervention and Comparison Practices or Groups (G).** The MDE will decrease as the number of intervention and comparison practices rises. Adding more providers to the study sample increases the likelihood of intervention and comparison groups that are similar before the intervention begins. This makes it easier to attribute outcome differences between intervention and comparison groups to the intervention as opposed to noise in the data. Adding practices is the most effective way to lower the MDE.

**Number of Patients per Practice (n).** The MDE will decrease as the number of patients per practice rises, but with diminishing returns. Increasing the number of study practices will improve precision more than adding additional patients per practice, even if the same total number of patients is included. In other words, it is better to have 20 practices of 1,000 patients each than 5 practices of 4,000 patients each.

**Proportion of Outcome Variance Explained by Regression Control Variables (R²n and R²G ).** Including control variables in regression models used to estimate impacts can lower the MDE of a study because the control variables will help explain some of the variation in the outcome measure. While it is important to include control variables that reduce unexplained individual (patient)-level variance (that is, increase R²n), far greater gains in precision can generally be achieved in practice-level interventions by including control variables that reduce unexplained group (practice)-level variance (increase R²G). Pre-intervention measures of outcome variables and practice-level covariates such as practice size and patient demographics often serve as the best control variables if available.

**Proportion of Practices Allocated to the Intervention Group (P).** Standard errors depend on intervention and comparison sample sizes, and MDEs are minimized for a given sample size when an equal number of practices are assigned to the intervention and comparison groups. However, in contexts where the costs of adding intervention practices are high relative to adding more comparison practices, increasing the relative size of the comparison group to a certain extent may still lead to important gains in precision.^{16}

As an example of how to calculate MDEs based on the equation described above, assume that a study randomizes 30 practices assigned in equal proportions to the intervention or control group, and that each practice serves 2,000 patients on average. Further assume that (1) based on background research, the researchers estimate the CV of hospitalizations will be 2.0 and expect that the regression model used to estimate impacts will explain 15 percent of group-level variance with five group-level control variables (that is, R²G is equal to 0.15), and (2) the ICC for the outcome is 0.03. The MDE of this research design in a two-tailed test at the 90 percent confidence level (α=0.10) with 80 percent statistical power (α=0.20) would be 30.3 percent.17 In other words, the study could significantly detect a true intervention effect equal to 30.3 percent of the control group mean with a probability of 80 percent. To make the example even more concrete, if costs were the outcome measure, and the average patient cost per month were $1,000, this study would detect an effect (a reduction or an increase) of $303 (30.3 percent of $1,000) most (80 percent) of the time. The study would be less likely to detect effects that were smaller than 30.3 percent.

As shown in the example above, MDEs in clustered studies can be quite large, and it is unlikely that an intervention could generate an effect of this size. This means that methodological decisions, especially those related to sample sizes, outcome measures, and study populations, become extremely important in ensuring that the research design has adequate statistical precision to ensure the study will generate useful findings.

#### Effective Sample Size in Clustered Design

Another way of thinking about clustering is in terms of how it changes the effective sample size. The effective sample size is determined by the actual sample size, the ICC, and the average number of patients per cluster (n), according to the following formula:

If there is no clustering (that is, the ICC is equal to zero), the effective sample size is equal to the actual number of patients in the evaluation. Suppose there were 20,000 patients in total, spread across 20 practices (or 1,000 patients per practice). With no clustering, the effective sample equals the actual sample, 20,000. If there is maximum clustering (that is, the ICC is equal to 1), the effective sample size is 20,000/[1+1*(1,000-1)] = 20, which is the number of practices—that is, the study effectively has only 20 unique observations. In practice, the amount of clustering tends to be closer to 0 than to 1. Suppose the ICC is 0.01; the effective sample size would then be 20,000/[1+0.01*(1000-1)], or 20,000/10.99, or 1,819.8. That is, this clustered sample of size 20,000 has the equivalent power of a simple random sample of size 1,819.8. If the clustering is 0.1, the effective sample size falls to 198.2.

back to top

## Appendix C

### Explanation of Figure 1 on False Positive Rates When Clustering Is Ignored

Figure 1 shows why studies need to take clustering into account. Decisionmakers are typically comfortable with some possibility of a false positive. In this situation, a false positive would be concluding that the medical home works when in fact it does not. By convention, we typically allow a 5 percent chance of a false positive (referred to as the alpha (α), Type I error rate, or significance level). In this graph, the horizontal line shows a 10 percent rate as the flat line. This indicates that decisionmakers would be willing to accept a higher (10 percent) false positive rate. But if the data are clustered—that is, the distribution of patient outcomes in one practice differs from the distribution in another—the chance of a false positive rises to levels decisionmakers will likely be uncomfortable with. A moderate level of clustering, if not accounted for, can lead to a 60 percent false positive rate, and if there is heavy clustering, the false positive rate could grow to 75 percent or more.

back to top

## Appendix D

### Sample Effect Sizes Found in Targeted Literature Review

We conducted a targeted literature review, examining selected health care interventions that were intended to improve quality of care and reduce health care costs, such as medical home or disease-management evaluations, to estimate plausible effect sizes. Table D.1 summarizes our findings.

Study (Author) | Population | Gross Savings (Without Costs of Intervention Unless Stated) | Reduced Hospitalizations |
---|---|---|---|

Note: Differences are statistically significant at the 10 percent level. NA = not available; NDE = no detectable effect. | |||

Group Health (Reid et al. 2010; 2009) | All, Privately Insured | Including intervention costs, 2% | 6% |

Geisinger (Gilfillan et al. 2010) | All, Medicare | NDE | 18% |

Geriatric Resources for Assessment and Care of Elders (GRACE) (Counsell et al. 2007) | All, Medicare | Including intervention costs, increased costs 28% in year 1, 14% in year 2, cost neutral in year 3. | NDE in years 1 and 2 |

GRACE (Counsell et al. 2007) | Chronically Ill, Medicare | Cost neutral in years 1 and 2, reduced costs 23% in year 3 | NDE in year 1, 44% in year 2 |

Guided Care (Leff et al. 2009; Boyd et al. 2010) | Chronically Ill, Medicare | NDE | NDE |

Medicare Care Management for High Cost Beneficiaries (CMHCB) (McCall et al. 2010) | Chronically Ill, Medicare | 12.1% | NA |

Medicare Coordinated Care Demonstration (Peikes et al. 2011) | Chronically Ill, Medicare | 5.7% | 10.7% |

back to top

## Appendix E

### Inputs from the Literature for Calculating MDEs

To help inform assumptions underlying MDE calculations, we compiled CVs and ICCs from 18 published and unpublished studies. Within the health care literature, our search terms for ICCs included “practice-level interventions,” “clustered designs,” “intracluster correlation,” and “intraclass correlation.” We found that only a limited number of practice-level studies mentioned adjusting for clustering, and only a subset of those studies actually reported their ICCs. This infrequent reporting of ICCs is consistent with reviews of the health care literature that found that only about 20 percent of clustered randomized trials took clustering into account in calculating the study sample size needed to have adequate statistical power, and only about half account for clustering in the analysis (see, for example, Bland, 2004). Moreover, because health insurance cost data are often proprietary and many studies use health utilization measures (rather than costs) as outcome measures, only a handful of studies reported ICCs for costs. Because ICCs were rarely reported, we asked the lead authors of many practice-level studies to provide their ICCs; some of the ICCs in our tables are based on personal communications with the study authors and are not reported in the published paper or report.

We have more CV estimates than ICC estimates, partly because we drew CVs from a broader set of literature. For example, CVs are often reported in studies on risk-adjustment. Moreover, we did not limit our search for CVs to practice-level interventions, but included studies conducted at the patient or physician level. Therefore, we were able to report the CV estimates (but not the ICC estimates) for different patient populations, such as the chronically ill. Based on the ICC estimates we did obtain, it appears that ICCs do not systematically differ according to whether the study population included all patients or was limited to the chronically ill, but this should be confirmed using the study’s data.

#### Ranges for Coefficient of Variation

Outcomes are more varied (and hence MDEs are bigger) when measured among all patients than among chronically ill patients. This occurs because all patients include both high-risk (or chronically ill) and low-risk patients, so outcomes (such as costs or hospitalizations) may range from zero to very high. As shown in Table E.1, costs are highly variable when measured for all patients, with CVs ranging from 2.17 to 5.17 within studies based on private payer (insurer) data for the general population. When the population is limited to those with chronic conditions or to the Medicare population, CVs become smaller. For example, for studies based on private payer data, CVs for patients with chronic conditions range from 1.8 to 2.74, depending on how narrowly *chronic condition* is defined. When the sample is limited to the general Medicare population, the average CV is about 2.46; the CV falls to about 1.85 when the Medicare population is further restricted to those with chronic illness.

CVs vary by outcome. The CVs for number of hospitalizations and number of emergency room visits are similar to those for costs, ranging from 2.43 to 6.0 for the privately insured general populations, and from about 2 to 3 when restricted to the chronically ill or Medicare populations. However, the CVs for number of hospital days tend to be so large (greater than 5 for all populations except the Medicare chronically ill), that it is unlikely that most studies will be able to reliably detect effects on the number of hospital days. The maximum CV for a binary variable is 1, for a variable with a mean of 50 percent. Because study designers often adopt the conservative assumption that a binary variable will have a mean of 50 percent (and therefore a CV of 1), we did not systematically review the literature for the means of binary variables.

#### Range for ICCs

Because ICCs vary by study, and because there are so few ICCs published in the literature, researchers are encouraged to use pre-intervention data from the practices in their sample to calculate ICCs for their planned analyses. (Sample code is in Appendix E.)

The degree of clustering appears to vary by outcome measure. Reported ICCs for health care costs are relatively low, ranging from 0.020 to 0.031 (Table E.3). Similarly, ICCs for health care service use (including number of emergency room visits and number of hospitalizations) range from 0.013 to 0.040. While ICCs for general satisfaction with health care tend to be low (ranging from 0.016 to 0.022), ICCs for specific satisfaction measures (access to care, coordination of care, etc.) tend to be much higher (ranging from 0.054 to 0.163). ICCs for quality-of-care process measures are also high (averaging 0.120) and have a wide range, from 0.058 to 0.25.

Study | Definition of Chronically Ill |
---|---|

Dale and Lundquist 2011 | Medicare beneficiaries with coronary artery disease, chronic heart failur e, diabetes, Alzheimer’s disease, or other mental, psychiatric, or neurological disorders; any chronic cardiac/circulatory disease, such as arteriosclerosis, myocardial infarction, or angina pectoris/stroke; any cancer; arthritis and osteoporosis; kidney disease; and lung disease, according to Medicare claims data |

Goetzel et al. 2010 | Obese adults (BMI greater than 30) |

Leff et al. 2009 | Medicare beneficiaries aged 65 or older in the top quartile of risk of using health services heavily during the following year (hierarchical condition category [HCC] score of 1.2 or higher) |

Littenberg and MacLean 2006 | Adults with diabetes |

McCall et al. 2010 | Medicare beneficiaries with HCC scores greater than or equal to 2.0 and annual costs of at least $2,000 or HCC risk scores greater than or equal to 3.0 and a minimum of $1,000 in annual medical costs (in 2005) |

Ozminkowski et al. 2000 | Individuals with malignant neoplasm, stroke, heart failure, psychiatric, diabetes, arthritis, seizures, COPD, asthma, or ulcerative colitis, according to private insurance claims data |

Peikes et al. 2011 | Medicare beneficiaries who met each program’s eligibility conditions and also had either (1) CAD, CHF, or COPD and a hospitalization in the prior year, or (2) two or more hospitalizations in the prior 2 years |

Philipson et al. 2010 | Adults with heart disease |

Outcome Measure | ICC | Study | Population |
---|---|---|---|

Costs | 0.021 0.031 | Campbell MK et al. 2001 Dale and Lundquist 2011 ^{e} | Patients with urology problems Chronically ill Medicare |

Hospitalizations | 0.025 0.014 0.030 | Dale and Lundquist 2011^{e}Huang et al. 2003 Leff et al. 2009 ^{e} | Chronically ill Medicare Asthma patients with managed care High-risk Medicare |

ER Visits | 0.020 0.040 0.013 0.015 | Dale and Lundquist 2011^{e}Huang et al. 2003 Leff et al. 2009 ^{e}Littenberg and MacLean 2006 | Chronically ill Medicare Asthma patients with managed care High-risk Medicare Adults with diabetes |

Satisfaction With Overall Care | 0.022 0.019 0.016 | Campbell SM et al. 2001 Dale and Lundquist 2011 ^{e}Potiriadis et al. 2008 | All patients Chronically ill Medicare All patients |

Satisfaction With Access to Care | 0.079 0.053 0.163 | Campbell SM et al. 2001 Dale and Lundquist 2011 ^{e},^{f}Potiriadis et al. 2008 | All patients Chronically ill Medicare All patients |

Quality-of-Care Process Measures | 0.186 0.069 0.058 | Campbell SM et al. 2001 Dale and Lundquist 2011 ^{e},^{g}Littenberg and MacLean 2006 | All patients Chronically ill Medicare Diabetes patients |

back to top

## Appendix F

### Sample Code to Calculate the ICC

The ICC of a given outcome measure can be estimated using one of several computerized statistical packages that support Analysis of Variance (ANOVA) and/or General Linear Mixed Model (GLMM) commands.^{18} Prior to analysis, data should be organized at the patient level so that each patient in the study sample has one record. That record should contain a practice identifier variable (called *practice_id* in the code below), so that each patient outcome (called *outcome_var* in the code) can be linked to the practice with which that patient was associated during the study. For illustrative purposes, sample SAS and Stata code used to estimate the ICC with the ANOVA method is provided below for a sample data set and outcome measure.^{19}

### SAS Approach

#### Code

#### Output

#### Calculating the ICC

### Stata Approach

#### Code

#### Output

#### Calculating the ICC

The ICC can be read directly from the console output (0.02626, shown boxed above).

back to top

## Footnotes

- Equation 1 presents the MDE in percentage terms to facilitate comparisons across different study outcome measures.
- More generally, researchers should consider adjusting for clustering (by incorporating clustering terms into variance expressions for the impact estimate) at any level where either intervention-comparison assignment occurs or sampling occurs in the study. For example, if an evaluation samples cities, and then practices within cities, the study should account for clustering at both the city and the practice level.13
- This formula would have to be modified to account for multiple levels of clustering (for example, if a researcher wanted to evaluate patient-level outcomes, but patients were clustered within physicians and physicians within practices). See Bloom (2005) for a derivation of this formula.14
- For a two-tailed t-test, the multiplier can be expressed mathematically as: M=(Tα/2 +Tβ), where Tα/2 and Tβ are critical values at which the t-distribution (with the associated model degrees of freedom) has a cumulative density of (1- α/2) and (1-β), respectively.15
- Bloom (2005) provides a broader discussion of the costs and benefits of unbalanced samples.16
- The standard error multiplier (M=T α/2 +Tβ) in this example can be calculated by looking up the critical t-values that correspond to a probability density of 0.05 (α/2) and 0.20 (β) on a t-distribution reference table with 23 (number of clusters – cluster-level covariates – 2) degrees of freedom, or by plugging 0.05 and 0.20 into an inverse t-tail function such as invttail() in Stata. In this example, the critical t-values are Tα/2=1.71, Tβ=0.86; the multiplier (M) can be found by adding these two terms together: 1.71 + 0.86 = 2.57.17
- The PROC MIXED command can be used to estimate the ICC using the GLMM in SAS; the xtmixed command can be used to estimate the ICC using the GLMM in Stata. By default, both commands use Restricted Maximum Likelihood algorithms to estimate the between-practice and within-practice outcome variance components.18
- Donner (1986) gives a thorough discussion of how to calculate the ICC using the means squared results from the ANOVA model.19

back to top