For example, we sample food by taking a bite or two and then deciding if it is too hot or too cold or if it needs sampling. Without sampling, we would need to eat all of the food before being able to decide what to do. Reading and viewing provide another example. You may decide to switch channels after watching a program for a few minutes. You may decide not to continue reading a novel after the first thirty pages or so.
The purpose of sampling is to make generalizations about the whole [the population] which are valid [accurate] and which allow prediction. If this spoonful needs salt, then it's likely that this would be true for others as well.
The sample, if it's a good one, must be representative of the whole. In order to achieve this, the sample should:
The nature of sampling will vary if the population is homogenous or if it is heterogenous. For our bowl of soup may be heterogenous, if all of the solid stuff is at the bottom of the bowl and the liquid is at the top. Graduate students in library science could be quite homogenous -- white, middle class females -- or heterogenous -- age.
We begin with a thoughtful and careful characterization of the population. The population is what we wish to be able to generalize too. Let's say that we wish to make generalizations about library science students. What is the population? Would SIS students be a reasonable sample? For example, the student population of SIS consists of graduate and undergraduate students. Graduate students may be local or distant. Is this a heterogeneous or a homogenous population? Asking questions is helpful:
Larger populations will usually require a larger [more expensive sample]. However, larger populations with good samples will allow for larger and more powerful generalizations. The researcher determines what population is appropriate and then what sample is appropriate. Here is an example:
If the population is small and not difficult to capture, a census that collects data on every unit in the population, may be best. In most cases, this is not feasible. However, there are obvious variables to consider:
Probability selection uses the fishbowl approach where all the units in a population are thoroughly mixed and then units are drawn so that every unit has an equal chance of being drawn. This is the soundest method and the one recommended by most research methods texts. The key is that every unit in the population has an equal and known probability of being selected. There is not opportunity for bias or selecting items according to some external purpose. A random number generator will easily create the list or you could actually use the fishbowl.
There are problems. Creating a sampling frame -- a listing of all of the units in the population can be quite a challenge. A biased sampling frame will yield biased results. For example, using the telephone book will exclude those who are not included. Drawing the sample can also be a problem, especially if some who are selected decide not to participate in the study. The time and money involved in getting a good sample is usually a problem, especially with human subjects.
With simple random sampling all units in the population are numbered and random numbers are used to select the sample.
With stratified random sampling, the population is divided into relatively homogeneous strata and then simple random samples are drawn from each strata. This should insure that important sub populations are included. It may also help to control for extraneous variables.
Disproportionate Stratified random sampling draws an equal number of respondents from each strata regardless of their percentage of the population. An example might be taking a sample of 50 respondents from type of library regardless of the percentage of the population that each one represents.
Cluster sampling draws samples from natural groups of respondents [clusters]. A cluster could be a class room, a city block, or even a country. This approach can save considerable time and effort by limiting geographical areas. Use cluster sampling when those to be sampled are available in clusters and where there is no list of individuals in the population, but there are lists of those associated with the cluster. Sick people and hospitals are a good example. Individuals within a cluster tend to be similar. You obtain less information if you interview the relatively similar five members of a single cluster than if you select one member from five different clusters.
This is any sort of sampling process which is not random [not all units of the population have an opportunity to be selected. These methods are popular because they are more convenient and less expensive.
Systematic sampling is the best known of these methods. It requires a list of units. After selecting the initial unit randomly, the other units in the sample are selected by every nth. It is easy and inexpensive to select units.
Convenience sampling involves selecting the units which are easiest to obtain. Typically, such samples do not represent the parent population because volunteers, friends, or those willing to participate are not typical. Selection error is selecting respondents who are most accessible and agreeable.
Bias is systematic sampling that removes certain units from sample consideration. For example, interviewing women at home between 9.00 and 5.00 systematically excludes those who work outside the home during those hours.
If your research involves publication, you may be asked to justify your sampling method and the size of the sample. It is relatively easy to snipe at sample size decisions. No sample will mimic the population exactly. Large samples have less error than small ones [remember the law of large numbers]. There is usually some tension between the desire to have a large, persuasive sample and the cost associated with gathering that sample. Often, samples are between 200 and 1000 units although marketing research typically uses much smaller samples.
A variety of formulas may be used to guesstimate the number of units needed to have a reasonable sample. In a classic bibliometric study, Webb suggested scaling the sample by the size of the parent population. A ten percent sample would be needed for populations between 100 and 1000. A five percent sample would be needed for populations between 1001 and 2000. A one percent sample would be needed for populations greater than 2000. Yamane, in Statistics: an Introductory Analysis, suggests this formula: n=N/(1=N*E2). With a population of 1450 certified medical librarians, n=1450/1+1450 x .0001 = 1450/1.145 =1266 sample for a one percent error rate. The same formula but with a five percent error rate yields a sample size of 313 [22 percent].
The more heterogeneous the population, the larger the sample needed. How do you know if the population is heterogeneous?
Too few people agree to participate in the study and you have a de facto convenience sample. It is important to consider how refusals and non-responses are handled.
There is no reasonable list of units or individuals to represent the population.
You do not know how those who responded differ from those who did not respond.
The population was not well defined. For example, are library users people who check out books? People who visit the library's website? Convenience sampling is often used and is heavily biased in favor of those who are current library users.