POPULATIONS, SAMPLES, AND BUILDING BRIDGES BETWEEN THEM IN EPIDEMIOLOGICAL STUDIES

W. Kalsbeek and G. Heiss

Department of Biostatistics and Epidemiology, respectively, School of Public Health, University of North Carolina, Chapel Hill, NC 27599-2400.

KEY WORDS: sample design, statistical inference, sample weights, analysis of data from complex samples

Shortened Title: POPULATIONS, SAMPLES, AND BRIDGES

Publication:

Kalsbeek, W and Heiss, G. (2000). "Building Bridges Between Populations and Samples in Epidemiological Studies," Annual Review of Public Health, 21:1-23.

CONTENTS

POPULATION SAMPLING IN EPIDEMIOLOGY

ISSUE 1: WHICH APPROACH TO STATISTICAL INFERENCE?

Design-Based Approach

Model-Based Approach

ISSUE 2: HOW SHOULD THE SAMPLE BE CHOSEN?

Information Gathering

Cluster Sampling?

Stratified Sampling?

Which Randomization Device?

How Large a Sample?

ISSUE 3: WHAT ABOUT SAMPLE INTEGRITY?

Frame Problems

Nonresponse

Sample Weights

ISSUE 4: WHICH SAMPLING FEATURES ARE IMPORTANT IN ANALYSIS?

CONCLUDING REMARKS

ABSTRACT

The increased use of rigorous population sampling methods and the analysis of data from those samples in cross-sectional surveys, case-control studies, longitudinal cohort investigations, and other epidemiological research efforts has raised important statistical issues for health analysts to consider. Using intuitive reasoning and a variety of empirical results from well-known data sources, this paper describes the origin, implications, and some plausible resolutions for several of these issues. Some of the main issues we consider include: establishing whom the sample represents, using sample weights, understanding the role of other important features such as the use of sampling stratification and the selection of clustered groups of population members, as well as finding ways to analyze study data with key sampling features in mind. Ultimately, resolution to all of these issues requires that analysts clearly define a reference population and then understand the role of design features in relating sample results to that population.

POPULATION SAMPLING IN EPIDEMIOLOGY

Most empirical knowledge throughout history has been based on incomplete observation and therefore samplings of the human experience (43). In each such case there has been a population, some collection of persons or objects for which knowledge was sought, and a sample, a portion of the population to be observed, thus providing the informational basis for acquiring knowledge about the population.

The need for sampling in epidemiological research stems from the nature of several overlapping types of research designs commonly used in the field (17,37,44). For each of these designs one makes statements about a targeted group of individuals called the study population, based on observations obtained from a representative portion of the population when it is impractical to examine the entire population.

Who or what is sampled, and how randomization is applied in choosing the sample, may differ among these research designs, depending on how the study population is best sampled to produce the estimates required by the design. For instance, the design of a cross-sectional study to profile the extent of disease or exposure to disease within a relatively confined timeframe, or of a field trial intending to investigate the efficacy of health intervention strategies, may call for gathering data from a sample of persons identified by selecting the communities, neighborhoods, or structures in which they live. Some of the data in cross-sectional studies and field trials may be best collected from medical or insurance records. Since the health-related interventions being evaluated in community intervention field trials are applied at the community level, sampling of subjects may proceed as above but including randomization of clusters of subjects at the community level. Sample ascertainment in case-control studies, where the research goal typically is to assess the role of a suspected exposure, often involves a mixture of random and nonrandom sampling. Identification of "cases" with the relevant medical condition is generally done through a purposive sampling of health care providers. On the other hand, the group of non-diseased "controls" is often randomly chosen from the set of non-cases in a source population using similar general population sampling methods to those used in cross-sectional or field studies, with explicit efforts made to achieve comparability to the set of cases, individually or as a group (4, pp. 108-113). Samples to prospectively observe the same set of study subjects chosen from a heterogeneous population or a series of targeted population subgroups in cohort studies may utilize similar methods of sampling, though unlike studies done at a single point in time, the statistical integrity of the initial cohort sample can diminish over time with attrition in the sample. For all of these research designs the population about which one makes statements of finding may not be the same as the population sampled. To make the leap of inference from the sampled population to some other population requires a justifiable basis (15).

While the use of samples in epidemiologic research has evident advantages in reducing the cost of data gathering, it is important to understand both the statistical and practical implications before using it. Sampling is particularly advantageous in studies where the population is large and/or scattered, and resources are scarce. Studies based on well-constructed samples can indeed be done for a fraction of the cost of a complete enumeration of the population, although savings do not equate to the percent of the population that is not sampled, because of the cost of sampling. Related to its cost saving, sampling also enables the investigator to concentrate resources on a smaller group, and thus increase study validity since greater effort can be expended on achieving higher participation rates and increased quality.

Some drawbacks associated with sampling derive from the fact that one gathers data from a fraction of the study population about which we hope to learn, leading to error in the estimates due to the absence of knowledge about those not sampled. This error, arising out of the difference between the estimate from the sample and the population characteristic being estimated, creates a statistical uncertainty one strives to minimize. Reduction of this sampling error is accomplished through prudent decision making in developing the sample design, a bit of good fortune in choosing which population members are randomly selected, and the use of a plausible approach to learn about the population from the sample.

This paper considers several important statistical issues that arise in the course of designing and analyzing data from samples in epidemiological studies. In so doing, we briefly trace the origins of the two main philosophies of statistical inference in population-based studies. The reader is referred to several excellent historical reviews of sampling and inference from sample data (29, 36,42,43). We also examine the process of deciding which features to include in creating a statistically appropriate sample of the population. Ways in which the statistical integrity of the chosen sample can be compromised are then noted, and we point out those features of the sample that are important in analyzing the sample data. In general we see that resolution to these issues is rarely clear-cut, thus requiring a balancing of the relative merits of alternatives.

ISSUE 1: WHICH APPROACH TO STATISTICAL INFERENCE?

In his dictionary of epidemiology Last (21) defines inference in statistics as "the development of generalization from sample data, usually with calculated degrees of uncertainty (21, p. 65)." This definition implies several things about the nature of statistical inference in epidemiological research. First, the concept of "generalization" suggests that there is an object to the inference --- a population of some form to which the statistical statements from the sample apply. Another implication of this definition is that the sample data are the basis for the any statements and that some philosophical framework is needed to accomplish the task. In a sense, the inference mechanism can be viewed as a "bridge" spanning the statistical chasm between information obtainable from the sample and information sought about the population. Finally, one notes that statistical "uncertainty" accompanies any statement made about the population from a sample, but that its level can often be quantified, thus providing a tangible indication of the sturdiness of the inferential bridge.

Design-Based Approach

The first recorded efforts to learn about populations from samples trace back over 200 years to attempts to estimate the population of France from a complete enumeration of a strategically chosen subset of geo-political districts (27). Stephan’s (43) informative account of the early history of sampling further notes that the use of randomization in choosing samples (i.e., probability sampling) first appeared in the latter part of the 19th century as an eventually controversial alternative to the purposive (nonrandom) methods of selection that been the accepted norm. An increased need for sample surveys during the economic and political turmoil of the 1930s provided the impetus to the further development of a design-based approach to estimation, or making statistical inference from samples where strategically chosen forms of randomization are applied in choosing the sample (Figure 1a).

The design-based approach produces statements about the population that are largely based on specifically how randomization was applied to the sample (circled I) whose data are used to make the statements of statistical inference (circled II). Much of the early theory of sampling methods following this approach was developed by statisticians at the U.S. Bureau of the Census, motivated by earlier work by Cochran and Neyman, and then incorporated into the first major sampling texts (see, e.g., 5,11,18). It has been the prevailing approach to estimation from samples for over 50 years, as described in more recent texts (24,26,39).

When a sample is used to produce an estimate (c) of some characteristic (C) of the sampled population (e.g., a prevalence rate, the mean of a measurement, or a coefficient in some regression model), uncertainty in inference following a design-based approach is measured by the estimate’s mean squared error (MSE). For a basic understanding of the origin and meaning of a MSE, first note that the presumed reason for choosing a sample is to estimate C using an estimation strategy that produces an estimate (c). Also note that a design employing probability sampling can produce many different samples, and thus values of c, each with a corresponding statistical probability which in principle at least can be determined. Considering only the effect of randomized selection in choosing the sample, and noting that the difference between an estimate and the characteristics being estimated, (c - C), is the estimate’s sampling error, the expected value of the square of the sampling error among all possible samples (and estimates) is called the mean squared error of c; i.e.,

, (1)

which as seen above is defined by the variance () and bias () of c among all possible samples. When considering the combined effects of randomness from sampling and other sources (e.g., measurement error, nonresponse, etc.), the formula for MSE(c) becomes more complicated, although it is still dependent on any variances and biases linked to these random sources (23).

Implicit in the formulation of are two fundamental statistical qualities of the sample design one considers in evaluating the ability of the resulting sample to produce estimates of C. One is validity. A valid sample is achieved by using randomization in a way that each member of the study population has some calculable chance of being chosen, provided the known probabilities of sample selection are properly used in making estimates of C. The impact of a valid sample on the is the potential to avoid biases contributing to , such as those arising out of mistakes in the estimation process (e.g., failure to use sample weights; see ISSUE 4), or due to incompleteness in the population lists used for sample selection (e.g., see coverage error under ISSUE 3). Another factor affecting the is the formulation chosen to estimate C, which through the use of ratio estimation, regression estimation, and other methods employing ancillary information can improve the quality of estimates (4; pp. 150-185).

Unfortunately, a valid sample design is not necessarily a "good" sample design in the broadest sense of the term. For instance, a sample of n=2 public school students in the U.S., chosen by randomly selecting two schools from a complete national list of public schools and then randomly choosing one student from the two school rosters, would be valid though clearly deficient in supplying information about the Nation’s students. This example suggests that validity is a necessary but not sufficient quality of a sample design. One also needs efficiency, measured by the stability of estimates among samples the design can produce; i.e., by the component of the , which is determined by sample size and by how effectively various selection strategies (e.g., stratification) are used. The use of probability sampling, combined with the inclusion of appropriate selection strategies to increase the statistical efficiency of estimates from the resulting sample, thus characterize a "good" sample design.

Model-Based Approach

During the past 30 years various estimation strategies under a model-based approach to statistical inference have emerged as an alternative to the design-based perspective (3,38). As suggested by its name, this class of analysis methods depends on statistical models whose main purpose is to explain the origin of each set of key outcome measurements in the sampled population (Figure 1b). For each measurement set, the study population is seen as an outcome from an underlying random process (circled I), which is portrayed by an assumed statistical model. Explained another way, members of the study population, with their associated measurements, are viewed as a random sample from a more abstractly defined, infinite set of measurements often called a "superpopulation." Making statements about the study population involves first learning about the superpopulation using data from the observed sample to fit the assumed model (circled II), and then, if specific knowledge is sought for the study population, relying on the fitted model to predict data for sample nonmembers in making statements about the study population as a whole (circled III). Indicators of statistical quality like the mean squared error of estimates are also use in evaluating model-based estimates, although the "uncertainty" reflected in these MSEs is tied to the random process defined by the assumed model or other sources of randomness, rather the how randomization was used to choose the sample. Significantly, different underlying models may apply to each set of outcome measurements. Also, a characteristic () of the superpopulation may be seen by some as the object of inference for some types of analysis (e.g., as the coefficient for an independent variable in an underlying multivariate regression model).

It is important to note that methods of randomized sampling are used in connection with both design- and model-based approaches, but for somewhat different purposes. Features of the selection process in a well-conceived sample under a design-based perspective are chosen to generate the best possible representation of the study population given available resources, since these same features must be explicitly accounted for in learning about the population from the sample.

A well-represented sample is also valued under the model-based approach, although specific aspects of the design are not as directly relevant to the inference process, since a key use of sample data is to fit the underlying model to learn about the superpopulation, and possibly the study population. For this reason, one might view the emergence of model-based methods as a return to the statistical inference based on nonprobability sampling that was in common use prior to the advent of probability sampling in the early 1900s. Now, however, estimation methods based on highly sophisticated models in conjunction with the classical theory of statistics can be applied using high speed computers, thus making their use more plausible. The reality of statistical practice at present is that the widespread use of model-based methods remains somewhat elusive to mainstream practitioners because most software packages do not implement them.

Also important in choosing between inference approaches is one’s level of comfort in using models in the design and analysis of a sample, since models are used under both approaches, although somewhat differently (15). Practitioners following a design-based approach rely on models primarily as a vehicle to guide attempts to improve the efficiency of the sample design, although this use of models rarely affects the validity of the sample design. For instance, cost and MSE models are used in "Neyman allocation" to decide how large the sample sizes should be for various population groups; yet allocation results from this model are robust to modest departures in the model (4; pp. 115-117). Moreover, modeling by design-based analysts is done to adjust sample weights to at least partially offset the biasing effects of sample imbalance due to nonresponse (see, for instance, the weighting class adjustment in ISSUE 3, where propensity of response for a member of the sample is estimated from the response experience of other "similar" members of the sample). Failure in these models can compromise the bias reduction goal of these adjustments (16). Models used in conjunction with model-based methods are central to the validity of estimation results, thus making it (estimation) vulnerable to model misspecification (12). Thus, if there are questions about the basic assumptions, the model-based methods may be inappropriate.

ISSUE 2: --- HOW SHOULD THE SAMPLE BE CHOSEN?

As noted previously, an effective sample design is key to the success of a population-based epidemiological study, regardless of the inference approach one follows. However, development of the sample design is typically given somewhat greater priority when the design-based perspective is followed, since the statistical quality of findings depends more heavily on how randomization is applied in choosing the sample. For this reason we largely adopt this perspective in this section of the paper, while recognizing that some of what we present also applies to the development of the sampling plan from the model-based perspective (e.g., the use of stratification).

Developing the sample design (i.e., the specific plan of action followed in choosing the sample) for an epidemiological study is largely a sequence of decisions that involves the study’s information goals and a variety of statistical "tools" that may be used to address specific statistical needs linked to those goals. This decision process is typically subject to a variety of constraints that are almost always fiscal in nature, but may also be institutional, logistical, or temporal. The goal of the design development process therefore is to find that configuration of sampling tools that will meet the scientific needs of the study, subject to any constraints. To be done well, the sample architect must be able to uncover the study’s needs and thoroughly understand the sampling tools that might be used, while successfully engaging the study’s decision makers in the design development process.

The design decision process in sampling is far from self-evident. First of all, choices in the decision process are rarely obvious. Thus, the science and art of decision making must converge to produce statistical optima with a measure of common sense. Implications (both positive and negative) must often be balanced in weighing the relative merits of alternatives in design decision making. Therefore, this process should not be seen so much as the search for a single "best" design, but for one among possibly several equally reasonable approaches. Gaps between the theoretical and actual effects of many design features make complete agreement on the statistical implications of many design alternatives virtually impossible (e.g., the effect of variable sample weights on the precision of estimates). Finally, while our presentation of how a sample design is developed might suggest that this process is purely sequential, some parts of the decision process may be iterative or overlap with other parts. For instance, the initial sampling plan may need to be revisited if it becomes apparent that the sampling frame called for in this plan is inadequate.

Two basic questions guide the development of a sampling plan. One asks what is to be learned about the study population from the sample, and the other (arising out of the first) asks which sampling tools are best used to meet the study’s information goals.

Information Gathering

The process of developing a sampling plan thus first requires information gathering about the study. This phase usually begins by defining the study’s scientific objectives. These typically dictate the type of research design to be used, and understanding these objectives allows the designer to identify population measures appropriate to the study goals and associated measurements that will figure into later design decisions. For example, investigating the efficacy of a new community-level nutrition intervention program in a field trial may imply several key outcome measurements like body mass, nutrient content, and portion size. The designer may then create or identify sources of existing information on these measurements, such as pretests or pilot studies done prior to the main part of the planned study. Data from prior research studies involving these or similar measurements may also be sought.

Once the study’s goals have been established, the next required design element is the definition of the study population. In design-based studies, this population is the same as the population to be sampled and the population to which statements from the population are to be inferred. In model-based design development the sampled population and inference population may differ. Defining the study population usually requires a set of eligibility criteria that must be met. These criteria typically refer to location and duration of residency, health status, as well as other socio-demographic characteristics of relevance to the study. As a byproduct of information on the objectives and population, one determines the units of observation (i.e., establishing whose data will be collected in the study).

Various types of population data may be useful in developing the sample design. In addition to population-wide descriptive data to help the designer understand the size and variation of these measurements among members of the population at large, it is often helpful to profile subgroup differences in these measurements to identify potential correlates of the measurements. Measurement-specific information on intraclass correlation, indicating how internally similar various cluster groupings of the population (e.g., adults living in the same county) are relative to the population as a whole, may also help with later decisions on if and how best to sample these clusters (4).

Another important item of background information on the study is the definition of subgroups, or domains, of the study population that may serve as a particular focus during analysis. For instance, changes in the demographic profile of the study population may dictate the need for assurances that estimates from a planned cross-sectional study will be of sufficient statistical quality for a subgroup. If in addition to learning about the population as a whole there is scientific utility in learning about these population subgroups as well, one aims to know what percent of the population is in these domains and how estimates from these domains contribute to meeting study objectives.

Analysis of population domains may be used to examine differences among policy relevant groups (e.g., by poverty level, health insurance status, or geographic region). It may also provide an initial search for predictors of key outcome measurements. This information on important population domains often impacts design decisions related to sample size, to ensure that adequate sample sizes will be available in the analysis phase to learn about both the overall population and key domains.

Several other nonstatistical pieces of information are relevant to the information gathering phase of design development. Information on resources available to the study may help the sample designer to decide on the level of complexity that can be tolerated in the eventual sample design. Information on resources and timeframe together guides the designer in assessing the feasibility of various design options. By way of illustration, one might decide against using strata for stratified sampling for a study if the required information must be obtained by in-person interviews, requiring extensive training of study personnel. Finally, one would also find out about possible resources that could be used to construct the sampling frame from which the study sample is drawn, such as administrative records or population lists. Planning sample selection for a cohort study of first grade students, for example, might explore the suitability of lists of elementary schools if these students were to be selected through the schools they attend. Information to be gathered and tested through pilot studies would include the completeness, currency, content, and accessibility of the list, parental consent, migration and occupational mobility, among others.

The next general step in the design development process is to determine which sampling tools best contribute to meeting the information needs of the study. Several sampling texts provide excellent accounts of the theory underlying these tools. (11,18,24,26,35,38). Our purpose here is to briefly examine the statistical utility and implications of several tools connected to three main features of a sample design: cluster sampling, stratification, and the device applying some form of randomization to select the sample. In so doing, we will also note how these tools might be configured into the sampling plan for epidemiological studies.

The decision making phase in the development of the sample design for a study begins by determining what mode is to be used in gathering data from the sample. The most common modes of data collection in epidemiological studies are mail, telephone, and in-person (28). Deciding on mode is done early, since resource and frame information often dictate it. For example, limited resources and the availability of a reasonably complete list of study population members often implies the need to do a mail survey, whereas the absence of a population list and a more generous set of resources needed to obtain more complex population measurements usually points to data gathering by telephone or in-person since higher response rates can be expected.

Cluster Sampling?

Once the mode of data collection is established, one considers if, and if so how, cluster sampling is to be used. A cluster sample is one in which a sample of groups of population members is chosen as a stage of the sampling process. Clusters are typically defined by levels of some type of socio-political hierarchy (e.g., levels for a statewide population of first graders consisting of: first grade students, within classrooms, within elementary schools, within counties, within regions, and within the state).

Cluster sampling involves randomized sampling within one or more levels of a hierarchy or the study population. Each level sampled corresponds to a stage of the sample. For example, a two-stage statewide sample of first graders might involve sampling schools in the first stage, thus designating the school to be the first stage or primary sampling unit (PSU). The second stage might then consist of separately choosing a sample of first graders within each sample school, thus making the student its secondary sampling unit (SSU).

Sampling clusters can substantially reduce study costs if data gathering requires face-to-face contact over a geographically expansive study area. It also eliminates the need for a sampling frame consisting of a complete population list. In many populations such a list is expensive or impossible to create. Accompanying these practical advantages is an important statistical disadvantage of cluster sampling. This limitation is manifest as an increase in due mainly to the tendency for members of the same cluster to be relatively more alike than members of the population at large. The extent of this within-cluster homogeneity is commonly measured using the intraclass correlation coefficient (), which for most measurements and hierarchies is between 0.00 and 0.15.

Relative to a comparable unclustered sample of the same sample size, contributes to a multiplicative increase in . In a design with n respondents chosen from m sample PSUs, this effect on the statistical quality of an estimate (c) is known as its design effect, which is commonly modeled as, , where is called the average sample cluster size. Thus, as long as clusters are relatively homogeneous with respect to the measurement corresponding to C (i.e., is positive), cluster sampling will always be less statistically efficient than a simple unclustered sample, with the amount of relative inefficiency directly related to the level of intra-cluster homogeneity. More intuitively, consider the case where =1 for the population measurement (i.e., where variation in the study measurement exists between clusters, but all members of the same cluster have exactly the same value). In this instance, relatively little information about the measurement in the population is obtained by sampling a few clusters, and increasing the number of population members selected within sample clusters adds no new information about the measurement. While most naturally occurring clusters are less homogenous than this extreme case, the same principle applies. To the extent that resources will allow, the best cluster sample is one with a larger number of clusters and small sample cluster sizes (11; Vol. 1, p. 286 ).

Stratified Sampling?

The second general feature that may be used in a sample design is stratification, the process of dividing a group of population members into non-overlapping subgroups called strata for the purpose of improving the efficiency of the sample design. Stratified sampling then means that the sample design has incorporated stratification somewhere in the selection process . Stratification is conceptually similar to cluster sampling in that it is applied to one or more levels of a population hierarchy. It can therefore be applied to sampling at any stage of a multi-stage cluster sample, although in multi-stage cluster sampling it is almost always used in choosing PSUs for the first stage of selection, since when properly applied its use offsets some of the losses in statistical efficiency caused by sampling clusters. It can also be applied to the sampling of individual population members in an unclustered sample.

The main difference between cluster sampling and stratified sampling is that groups are not sampled at a stratification level of the population hierarchy. For example, if stratification were applied at the "region" level of the statewide hierarchy of first grade students previously presented , that would imply that the two-stage sample of first graders would be separately chosen in each of the state’s regions.

Sample stratification is used to improve the statistical efficiency of certain study estimates (c). It may be used to reduce for total population estimates by assuring adequate representation of all strata in the sample. To best achieve improvement in the efficiency of total population estimates one hopes for key study measurements to be statistically correlated with the population characteristics used to define the strata. Improved efficiency is achieved in this case since the sample is more likely to reflect the full spectrum of individual measurements tied to the population characteristic (C), thus tending to minimize the sampling error of estimates (c). Stratification may also be used to improve the efficiency of estimates for relatively small but important population subgroups by "oversampling" (i.e., designating disproportionately large sampling rates to) the strata that define or contain large percentages of them. In some designs stratification is used for both purposes.

In addition to deciding how to stratify the design one must determine how the overall sample is to be allocated among strata. Four types of stratum allocation are commonly seen. When benefiting total population estimates is the most important use of stratification, either proportionate or optimum allocation is commonly used. All stratum-specific sampling rates (i.e., the proportion of population members in the sample) in proportionate allocation are the same, thus producing a sample with proportionately the same representation among strata as the population. This is a safe but not necessarily the best allocation for total population estimates. On the one hand, the efficiency is no worse that a comparably unstratified sample, but it may not be the best either.

Sampling rates in a design with optimum allocation among strata are the most cost-efficient ones based on models for and the cost of doing the study. For a simple descriptive population characteristic (C) to be estimated from a stratified sample, this means that one must apply the largest sampling rates in the strata with the greatest diversity in the measurements of relevance to the definition of C and the lowest cost of adding another member to the sample. Conversely, strata with the least diversity and highest unit costs are assigned the lowest sampling rates. In theory, then, optimum allocation leads to more efficient total population estimates than proportionate allocation, although in practice the difference in quality may not be great if stratum costs and diversity are relatively similar among strata.

While one can achieve close to the best precision of total population estimates if stratum-specific cost and diversity data are reasonably good, there is an element of statistical risk in the use of optimum allocation if the stratum data are substantially incorrect. The use of stratification in this worst-case situation can produce worse efficiency in total population estimate than not using stratification.

Another use of disproportionate allocation among strata is to facilitate an "oversampling" of one or more relatively small but important population domains. For instance, if the two-stage sample of first graders above is used for a cohort study to evaluate the long-term health effects of immunization one might need to focus on students with disabilities. If the percentage of students with disabilities is small, the sample size in this important subgroup may be too small to adequately achieve study goals. Hence, students with disabilities might be oversampled by stratifying students by disability status in sampling within selected schools, and applying relatively higher sampling rates in the disability stratum of each school. Oversampling in a design setting like this is likely to achieve the sample size increases that are sought since the group to be oversampled can be fully isolated in the strata that are formed. It may be much less effective in achieving dramatic sample size increases when the trait cannot be as effectively isolated (14).

Balanced allocation in stratified sampling involves designating the same sample size for each stratum. This allocation is generally used in designs for strata of unequal size and where the main use of the sample data is to prepare stratum-specific estimates or to compare estimates among strata. This allocation is thus disproportionate in populations with unequal-sized strata, and thus may somewhat limit the efficiency of estimate from the total population, to the extent that the size and composition of the strata in reference to the main study measurements are not correlated. This loss in precision is tied to the effect of variation in sample weights that this allocation yields.

Which Randomization Device?

Decisions concerning where in the population hierarchy cluster sampling and stratification are to be applied determine the structure of the sample to be generated. Yet to be determined for each stage of sampling is precisely how randomization will be used in sampling units at the corresponding level of the hierarchy. Several options are available to the designer, although all require a list (or frame) of units to be sampled to implement the selection process. One selection device is simple random sampling with replacement (SRSWR), in which each selection at random from the list is replaced in the list before the next selection is made. Repeat selections are therefore possible in SRSWR. Unclustered samples chosen in this way most closely resembles iid (independent and identically distributed) random samples assumed in much of classical statistical theory.

Another commonly used selection device is simple random sampling without replacement (SRSWOR), which is similar to SRSWR except that selections are not replaced on the list and the resulting sample thus has no repeat selections. The advantage of SRSWOR over SRSWR is higher statistical efficiency for study estimates due primarily to the so-called finite population correction, which for simple uncluster SRSWOR is 1-f, where f=n/N is the sampling rate where a sample of size n is chosen from a population of size N.

Another class of selection devices is based on selection of clusters with probabilities proportional to size (PPS), where "size" refers to the number of population members in each cluster. Since actual size measures are usually unknown, hopefully accurate "measures of size" (Mos) are used instead. Several with- and without-replacement PPS sampling methods have been proposed (4). PPS sampling is generally used to select clusters in all but the last stage of multi-stage samples, particularly when the number of population members varies considerably among clusters at all levels. One uses PPS sampling mainly to offset reductions in estimate efficiency that can result from applying SRS devices to clusters of unequal size. Common byproducts of PPS sampling in a multi-stage design are equal selection probabilities for all chosen population members (a statistical advantage), and roughly equal sample sizes in each sample cluster (a practical advantage).

Finally, systematic sampling is a relatively simple selection device that is often used when other devices are too complicated to use. Selection involves choosing a random start, finding the corresponding sampling unit on the frame, and then selecting other sampling units on the frame by sequentially applying an interval of constant length (i.e., roughly the inverse of the intended sampling rate) after the random start until the entire frame has been traversed. Unlike most other randomization devices in sampling, where the order of the sampling frame is irrelevant, frame order is important and can be beneficial in samples using systematic selection. Indeed, estimates from a sample employing systematic selection from a frame that is ordered by some set of criteria have roughly the same statistical efficiency as a proportionately allocated using strata defined by ordering criteria. A systematic sample chosen in this way is said to be implicitly stratified by those criteria. The primary drawback to systematic sampling is that unbiased estimates of are unavailable, thus necessitating the use of approximate methods which tend to understate the actual efficiency of estimates (18). Since these variance estimation methods follow the order of selection, this process information must be retained with the sample data.

How Large a Sample?

Deciding on sample size is a design issue with either a statistical or practical solution, but with statistical implications in any event. This decision is usually made after the background information has been collected and a degree of closure has been reached on the issues of cluster sampling and stratification. The statistical solution requires first that statistical standards of statistical efficiency be established in collaboration with study leaders for each important estimate to be generated and test to be run during analysis.

Various statistical measures of statistical efficiency may be considered, including the variance, standard error, margin of error, and power to detect significance, although most are mathematically related to the variance of study estimates, , which in turn can be written as a function of the desired sample size (n) for most designs. Solving for n in the assumed efficiency model leads to the result. For example, one might require a margin of error of 0.05 on the overall estimated rate (c) of immunization coverage in the baseline of the statewide cohort study mentioned earlier. If a stratified two stage cluster sample is to be used in that study, a reasonable model for the margin of error would be , where t is the confidence level for the margin of error and C is the conjectured actual coverage rate in the population. Solving for n here we have , assuming t=1.96 for 95% confidence, C=0.70, and =1.50. If the quality standards are modified slightly to require the same margin of error, but for a region that comprises 25% of the state’s population, the recommended total sample size must be increased to, .

When resources determine how large a sample a study can afford, the usual strategy is to create a simple formula for the total cost of the study as a function of n, and then solve for n as above. Even when the proposed sample size is based on the study’s budget, it is wise to project the level of statistical quality on important study estimates, to be assured that the level of efficiency one can afford is sufficient. For example, if we find that the budget will only support n=300 baseline respondents, this implies that, , which may still be acceptable statistical efficiency for the study.

While the sample design in its final form consists of details on how to choose the study sample, the actual selection process may depart somewhat from this plan. Several facets of the study implementation may cause this departure. For example, attrition in the selected sample may exceed that which was expected, thus requiring midcourse adjustments in sampling rates or adoption of a random substitution plan to replace nonrespondents. The original plan for constructing a PSU sampling frame may be revised to deal with an unexpected number of virtually empty clusters, or the definitions of strata used in selecting the first stage sample may be changed to improve variance reduction effect of the sampling strata. These modifications can challenge one’s ability to effectively use data from the sample in learning about the population, thus demanding ways to deal with their implications. We next describe some ways to deal with these practicalities.

ISSUE 3: WHAT ABOUT SAMPLE INTEGRITY?

Some parts of the intersection of sample design and the study implementation impact the efficiency of study estimates, while others strike at the essential validity of the resulting sample-generated estimates. We turn our attention now to the effects of the sample selection process on the statistical integrity of the sample. In so doing we identify the main sources of lost integrity and some common remedies.

Statistical error in study estimates not due to sampling can be grouped into three main categories: error due to coverage problems with the sampling frame, error arising form attrition in the sample, and error due to problems in the study measurement instruments (10,23). While measurement error is an important source of survey error, it does not involve problems that uniquely influence the ability of the sample to function as the population in miniature. On the other hand, frame problems can influence sample representation if the sample is selected from a universe other than the study population, while nonresponse can affect representation by creating selective imbalance in the sample on traits for which differential rates of response occur.

Frame Problems

Several troublesome problems can arise with the frame that is used to choose a sample. All reflect a lack of correspondence between entries on the frame and the individual members of the study population. Each addresses the ability of the frame and thus the sample to "cover" the population, which is why coverage error is another way of referring to problems with the frame in epidemiological studies.

Most frame problems fall into one of three categories. One type of problem is undercoverage, where some members of the population are not linked to any entry on the frame. Undercoverage is usually the most serious problem and thus most widely recognized, since it can contribute to significant increases in both and , particularly the former. It is a well-known problem in cross-sectional and cohort studies that gather data by telephone, where persons without a telephone, or without a phone directory listing, will be excluded. The primary manifestation of undercoverage is imbalance in the sample due to differential rates of coverage by the frame. For instance, sample coverage in telephone samples aimed at the general population tends to be correlated with household income, race/ethnicity, education, and employment status (9). Bias due to undercoverage is inversely related to the sample coverage rate (i.e., the proportion of the population that is linked to the frame) and directly related to the aggregate difference (in the study measurement corresponding to C) between those who are and are not covered by the frame.

A second frame problem is overcoverage, where some entries on the frame are linked to nonmembers of the population. These "ineligibles" in the selected sample are usually recognized and become a source of sample attrition, though the statistical effect of their presence is mainly to reduce the sample sized and thus increase . They do not contribute to , and thus invalidate an otherwise valid sample. Movers and decedents are examples of ineligibles in many cohort studies.

The third type of frame difficulty is multiplicity, in which a member of the population is linked to more than one entry on the frame, thus giving it multiple chances to be chosen. Patients sampled through health care providers in case-control studies represent one type of sample with multiplicity present. Like undercoverage, frames with multiplicity present can lead to increases in both and . Bias is increased in estimates if those with and without multiple links differ in the aggregate with respect to the study measurement tied to C, and nothing is done to compensate for the multiplicity. Variance may increases due to variable weights if the analyst uses a so-called multiplicity estimator, in which the data are specifically weighted to account for increases in the sample selection probabilities for those with multiple links of the frame (41).

Nonresponse

Nonresponse is another practical aspect of a study that can lead to loss in sample integrity by creating imbalance in the outcome of an otherwise well-conceived sample design. This loss occurs mainly because response rates tend to differ by certain types of study characteristics such as race/ethnicity, age, income, education, population density, etc. (10). The primary statistical manifestation of nonresponse is an increase in , although can also increase if steps are not taken to offset resulting reductions in the sample size and if nonresponse is viewed as a somewhat random phenomenon (23, pp. 134-137). If one views population members as certain to respond or not respond, bias due to nonresponse is inversely related to the sample response rate (i.e., the proportion of the study-eligible members of the sample who respond) and directly related to the aggregate population difference (in the study measurement corresponding to C) between respondents and nonrespondents

Frame and nonresponse problems are handled somewhat similarly. Preventing each type is important and possible, but more commonly applied to nonresponse. Correcting mistakes on large frames can be relatively costly, while efforts to improve response rates is often feasible through the use of a variety of generally effective preventive strategies (e.g.,use of incentives, endorsements, and additional attempts to gain participation). Coverage and nonresponse problems can also be remedied by special efforts to measure the bias effects (e.g., through more intensive solicitation to gather study data from of a sample of initial nonrespondents to the baseline round of a cohort study). The last category of compensatory strategies involves making statistical adjustments to the sample data as part of the process of generating sample weights.

Sample Weights

A sample weight is a number tied to a member of a sample that is intended to reflect the inverse of the member’s selection probability, which is calculable in any probability sample. The weight ( ) for each (i.e., the i-th individual) sample member can also be interpreted as the number of population members represented by that member. A single set of these weights is prepared for analyses involving data gathered for the sample to whom the weights apply.

The use of weights in preparing estimates from samples traces back nearly 50 years to the work of Horvitz and Thompson (13), who first noted that unbiased estimates of population totals could be obtained by weighting the data in performing the analysis (i.e., in effect multiplying each sample measurement by its corresponding weight in aggregating the data to estimate C). For example, the sum of the amongst all sample members is an unbiased estimate of the size of the study population (N), and the weighted sum of a population measurement (i.e., the product of measurement times weight summed over all sample members) is an unbiased estimate of the population total for that measurement.

The process of calculating sample weights becomes a part of the strategy of dealing with frame and nonresponse problems when, as is frequently the case, the final set of weights are adjusted for the imbalance resulting from these problems. A probability of selection () is first calculated for each sample member, based specifically on how randomization was used to choose the sample. For example, this probability would be the product of the stage-specific selection probabilities in the two-stage cluster samples used in the cohort study on childhood immunization. The reciprocal of becomes a provisional weight for the sample member; i.e., .

The first of two multiplicative adjustments is then made using estimates of the likelihood, or propensity of response () for each sample member. Intended as at least partial compensation for nonresponse bias exclusively, the first adjusted weight is calculated as, . Estimates of response propensity may be response rates in strategically formed subgroups of which members of the original sample are a part (16), or as one’s predicted response status outcomes from a multivariate regression model (8). Ideally, one hopes to form internally homogeneous adjustment subgroups so that respondent and nonrespondent portions have similar values for the population characteristic of interest, since reduction in estimation bias occurs to the extent that this type of homogeneity is achieved (16).

To further compensate for any remaining imbalance due to nonresponse and other imbalance arising out of random selection and any other sampling problems attributable to the frame, the adjusted weights () are post-stratified to the best available distribution of population counts by a joint classification of the study population according to one or more characteristics that are known to be correlated with key study measurements (e.g., age, race/ethnicity, and gender). This step amounts to calculating the final adjusted weight as, , where the numerator of the post-stratification adjustment (in brackets) is the external population count for the group (g) of which the i-th sample member is a part, and the denominator is the estimate of obtained by summing over all sample members in that group. Calculated values of are added to the sample data file for analysis, thereby completing the process.

It is important to note that the two weight adjustments contribute to reduce estimation biases that may occur because of imbalance in the sample, but they rarely eliminate these biases altogether. Thus, the investigator may rely on other methods for dealing with imperfect frames and nonresponse, such as sampling a portion of the nonrespondents and applying extraordinary means to gather data from them (23, pp. 177-181).

One final sample integrity issue has to with the retention of structural identifiers of the sample design on the data files that are used for analysis. Since cluster selection, the use of stratification, and variation in selection probabilities may all be important in making statements about the study population from the sample, retaining complete selection files and sample identifiers for clusters sampled and strata defined at each selection stage is essential to facilitate the accommodation of these features in subsequent analysis. Failure to do so can make learning about the population more difficult, especially when analysis methods following a design-based approach are used. Given the presence of this sampling information, which of it is essential in successfully building the bridge between sample and population?

ISSUE 4: WHICH SAMPLING FEATURES ARE IMPORTANT IN ANALYSIS?

As we have seen, sample designs for several types of epidemiological studies may include any of several features that impact the statistical quality of estimates from the sample, including cluster sampling, stratification, and varying selection probabilities among sample members (leading to the computation of sample weights). Employing these features, however, yields samples whose data are neither independent nor identically distributed. The analyst of such data must then establish to what extent these features should be accommodated in analysis?

Much of the design- and model-based theoretical work in survey statistics during the past two decades has focused on analysis of data from samples with complex designs. Several helpful reviews have been written on this topic (15, 31, 32, 36). While earlier work during this period displayed a somewhat dramatic divergence in opinion concerning the importance of design features (design-based advocates said all features were important;

proponents said all features can be ignored), more recent results seem to suggest a convergence of views. Model-based statisticians now link weighted estimates with model-based estimation, and recognizing that error residuals in regression models may differ among clusters and strata has also led modelers to seek ways to account for these design features in analysis. Most, in fact, advocate the use of stratification in sample selection. During this same period, design-based analysts have widely used models in detailing their use of stratification and other design features. They have also recognized that their estimates of coefficients in regression modeling may not addressing the underlying interrelationship among variables in their models, especially when the population size is not large. Another important part of this convergence is a growing consensus concerning the importance of incorporating features of the sample design in the analyses. Research and debate continues, however, as to precisely which and how features should be incorporated.

The design-based perspective considers the use of randomization in choosing the sample as the primary basis for estimating the characteristic the study population. In this view, the way one formulates the estimate (c) of the population characteristic (C) depends on how the sample is chosen. Thus, the size of and in will similarly depend on the nature of the sample design, as will appropriate estimates () of the variance of c from the sample. Since learning about the population by means of confidence intervals and tests of hypothesis requires both c and , it stands to reason that analysis from the design-based perspective must consider cluster sampling, stratification, and sample weights. Several empirical comparisons of the statistical effects design specification have been reported (1,20,42). Incorporating all design features has stimulated the development of a number of widely available analysis software packages following several different approaches (46) to obtaining from a design-based perspective. A number of recent reviews of these packages have been published (e.g., see 2, 6).

Incorporating design features is especially important in descriptive profiles and simple comparative analyses, such as those found in cross-sectional studies, field trials, cohort studies, and others where results of the study are most appicable to the sampled population. In the design-based view, ignoring cluster sampling tends to understate and overstate significance levels in tests of hypothesis, whereas ignoring stratification has the opposite effect. Weights are also needed in computing both c and , although they are generally less important for the former, except when distinctive segments of the population are oversampled. In this case ignoring weights (i.e., weights are considered constant for all sample members) contributes to biased estimates in the direction of study measurements in the oversampled subgroup. Descriptive analysis from a model-based perspective may be relevant for epidemiological studies with a less explicitly defined study population, such as those that might occur in case-control studies and clinical trials, where the sample is not seen as representing any particular group.

The size of may increase depending on the amount that weights vary. When weights and study measurements are largely uncorrelated, one simple model for the multiplicative effect of variable weights on is , where and are, respectively, the variance and mean of the sample weights (18, pp. 427-429). To reduce this adverse effect on study estimates, widely variable weights are sometimes "trimmed" by censoring and redistributing the original set of weights. Unfortunately, weight trimming can also increase , so the approach of trimming is often dictated by minimizing impact on (34).

For regression modeling one must consider whether coefficients are attributes of the study population or the underlying population. In design-based model fitting with inference to the study population all three key features (i.e., weights, cluster sampling, and stratification) are needed, since failing to incorporate them, particularly cluster sampling, tends to understate and overstate significance from 0 in tests of coefficients (40). On the other hand, in regression analysis taking a purely model-based path to the underlying population, all three features can be ignored, provided one can justify the assumption that error residuals associated with assumed model do not depend on the size of weights, or the cluster or stratum of membership. This assumption may work in less tightly defined study populations (e.g., in case-control studies) but is often unjustified in studies with more definitive inference destinations, thus necessitating some form of design accommodation. Some results suggest that at the very least weights should be used as a guard against model misspecification (7, 30). Others suggest using cluster and stratum identifiers as control variables in the model may be useful in fitting the underlying model (22, 32).

CONCLUDING REMARKS

Regardless of the philosophical approach one takes to building the bridge between sample and study population, how the sample is designed, selected, and accounted for in analysis are important elements of any population-based epidemiological study. While theoretically based principles of sample design are well-established, we continue to examine the foundations of inference from the resulting samples

In this paper we have described the basic elements of a sample design and how one goes about combining these elements into an effective sampling plan. In the process, we have seen that many of the decisions one makes in producing and dealing with the sample depend on the path of statistical inference and are therefore less clear-cut. While consensus on precisely how one reflects the design in analysis has not been reached, it is generally agreed that certain features of the sample design are relevant to the task, the sample weights in particular. As more is sought from the results of population-based studies, new insights will be needed on how best to learn from the complex sample designs used in epidemiology.

Literature Cited

  1. Brogan D. 1998. Pitfalls of using standard statistical software packages for sample survey data. In Encyclopedia of Biostatistics. New York: Wiley
  2. Carlson BL. 1998. Software for statistical analysis of sample survey data. Encyclopedia of Biostatistics. New York: Wiley
  3. Cassel CM, Sarndal CE, Wretman JH. 1977. Foundations of Inference in Survey Sampling. New York: Wiley
  4. Cochran WG. 1977. Sampling Techniques. New York: Wiley. 3d ed
  5. Cochran WG. 1953. Sampling Techniques. New York: Wiley
  6. Cohen SB. 1997. An evaluation of alternative PC-based packages for the analysis of complex survey data. Am. Stat. 51:285-292
  7. DuMouchel WH, Duncan GJ. 1983. Using sample survey weights in multiple regression analysis of stratified samples. J. Am. Stat. Assoc. 78: 535-543
  8. Folsom RE, Witt MB. 1994. Testing a new attrition nonresponse adjustment method for SIPP. Proc. Sec. Surv. Res. Methods, Am. Stat. Assoc., Vol. 1. Toronto, Canada
  9. Groves RM, Biemen PP, Lyberg LE, Massey JT, Nicholls WL, Waksberg J, eds. 1989. Telephone Survey Methodology. New York: Wiley
  10. Groves RM. 1989. Survey Errors and Survey Costs. New York: Wiley
  11. Hansen MH, Hurwitz WN, Madow WG. 1953. Sample Survey Methods and Theory, Vols. 1, 2. New York: Wiley
  12. Hansen MH, Madow WG, Tepping BJ. 1983. An evaluation of model-dependent and probability-sampling inferences in sample surveys (with discussion). J. Am. Stat. Assoc. 776-807
  13. Horvitz DG, Thompson DJ. 1952. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47: 663-685
  14. Kalsbeek WD, Cohen SB. 1978. Disproportionate Sampling in the National Medical Expenditure Survey. Proc. Soc. Stat. Sec. pp. 276-281. Am. Stat. Assoc.
  15. Kalton G. 1983. Models in the practice of survey sampling. Int. Stat. Rev. 51:175-188
  16. Kalton G. 1983. Compensating for Missing Survey Data. Ann Arbor, MI: University of Michigan
  17. Kelsey JL, Whittemore AS, Evans AS, eds. Methods in Observational Epidemiology. New York: Oxford University Press. 2nd ed.
  18. Kish L. 1965. Survey Sampling. New York: Wiley
  19. Kish L. 1987. Statistical Design for Research. New York: Wiley
  20. Korn EL, Graubard BI. 1991. Epidemiologic studies utilizing surveys: accounting for the sampling design. Am. J. Public Health. 81:1166-1173
  1. Brogan, D. 1998. Pitfalls of using standard statistical software packages for sample survey data. In Encyclopedia of Biostatistics. New York: Wiley
  2. Skinner CJ, Holt D, Smith TM, eds. 1989. Analysis of Complex Surveys. New York: Wiley
  1. Last JM, ed. 1998. A Dictionary of Epidemiology. New York: Oxford University Press
  2. Lee ES, Forthofer RN, Lorimer RJ. 1989. Analyzing Complex Survey Data. London: Sage Publications
  3. Lessler JT, Kalsbeek WD. 1992. Nonsampling Errors in Surveys. New York: Wiley
  4. Levy PS, Lemeshow S. 1999. Sampling of Populations - Methods and Applications. New York: Wiley. 3rd ed.
  5. Little RA. 1991. Inference with survey weights. J. Official Stat. 7: 405-424
  6. Lohr, SL. 1999. Sampling Design and Analysis. Pacific Grove, CA: Duxbury Press
  7. Moheau. 1778. Recherches et considerations sur la poplulation de la France. Republished 1912. Paris: Libraire Paul Geuthner
  8. Moser CA, Kalton G. 1972. Survey Methods in Social Investigation. New York: Basic Books. 2nd ed.
  9. Nathan G. 1988. Inference based on data from complex sample designs. In Handbook of Statistics, ed. PR Krishnaiah, CR Rao, Vol. 6. Amsterdam: North-Holland
  10. Pfefferman D, Holmes DJ. 1985. Robustness considerations in the choice of method of inference for the regression analysis of survey data. J. R. Stat. Soc. 148:268-278
  11. Pfefferman D, Smith TM. 1985. Regression models for grouped populations in cross-section surveys. Int. Stat. Rev. 76:681-689
  12. Pfeffermann D. 1996. The use of sampling weights for survey data analysis. Stat. Methods Med. Res. 5: 239-261
  13. Pocock SJ. 1983. Clinical Trials: A Practical Approach. New York: Wiley
  14. Potter FJ. 1990. A study of procedures to identify and trim extreme sampling weights. Proc. Sec. Surv. Res. Methods., Am. Stat. Assoc., Washington, DC, pp.225-230
  15. Raj D. 1968. Sampling Theory. New York: McGraw-Hill
  16. Rao JN, Bellhouse DR. 1990. History and development of the theoretical foundations of survey based estimation and analysis. Surv. Methodology. 16: 3-29
  17. Rothman KJ, Greenland S. 1998. Modern Epidemiology. Philadelphia, PA: Lippincott-Raven
  18. Sarndahl CE, Swensson B, Wretman J. 1992. Model Assisted Survey Sampling. New York: Springer-Verlag
  19. Scheaffer RL, Mendehall W, Ott L. 1996. Elementary Survey Sampling. Belmont, CA: Duxbury Press. 5th ed.
  20. Scott AJ, Holt B. 1982. The effect of two-stage sampling on ordinary least squares methods. J. Am. Stat. Assoc. 77:848-854
  21. Sirkin MG, Levy PS. 1974. Multiplicity estimation of proportions based on ratios of random variables. J. Am. Stat. Assoc. 69: 68-73
  22. Skinner CJ, Holt D, Smith TM, eds. 1989. Analysis of Complex Surveys. New York: Wiley
  23. Stephan FF. 1948. History of the uses of modern sampling procedures. JASA***. 43: 12-39
  24. Timmreck TC. 1998. Introduction to Epidemiology. Sudbury, MA: Jones and Bartlett. 2nd ed.
  25. US Bureau of the Census. 1978. The Current Population Survey: Design and Methodology. Technical Paper 40. Suitland, MD: Bureau of the Census, US Department of Commerce
  26. Wolter KM. 1985. Introduction to Variance Estimation. New York: Springer-Verlag