Probability sampling
In this practical we will practice carrying out three common probability sampling methods using Excel.
Rationale and theory
If you are confident on the theory and approaches to probability sampling then you can skip on to the exercises below. However, if you are not fully confident about these topics we strongly suggest reading the below information.
Key terminology: units of observation and analysis
Read/hide
Sampling unit: these are the entities or things that you sample for the study. These will often be individuals but could be anything, such as households or health facilities. The sampling unit should be fully defined in the description of the study’s sampling methods.
Unit of observation (also called study units or study subjects): these are the entities or things that you collect data from. Again, these will often be individuals but could be anything, such as households or health facilities. Note that you might sample “higher level” units, such as entire households, but collect data from “lower level” units, such as individuals within those households. The unit of observation should be fully defined by the study’s eligibility criteria.
Unit of analysis: the entity or thing that you analysing in your study and thereby strictly aim to make statistical inferences about using the data collected. This will often be the same as the unit of observation, but it may differ depending on how you are analysing the data. For example, the unit of observation may be individuals, but the unit of analysis may be the household if you aggregate individual outcomes within each household (e.g. as a single mean or percentage value per household).
Overview of key concepts
Read/hide
Probability sampling
Broadly speaking, sampling is the process whereby we choose units of observation within our target population to collect data from. Therefore, the first stage of sampling, which should always occur very early on in the research and study design process, is when you clearly and explicitly define your target population. In summary, the only sampling methods that are guaranteed to produce unbiased statistical inferences, albeit only on average (or to use the more technical term “in expectation”, meaning in the long-run), are “probability sampling” methods. More loosely these are sometimes referred to as random sampling methods. This is because all frequentist statistical methods of analysis assume that the sample data have been randomly sampled from a given target population using some specific probability sampling method. Therefore, the only way to robustly satisfy this assumption is clearly to use some form of probability sampling method.
To take a probability sample you require a sampling frame, which is a list of all units of observation in the target population that you are aiming to generalise your results to (e.g. all individuals in a region, all primary care health facilities in a country etc). Therefore, the second stage of any sampling process is to obtain or create a sampling frame for your target population. Note: this excludes multi-stage cluster sampling methods, where you require a sampling frame for the first stage clusters, but at the second/subsequent stages you usually create the sampling frame via a mapping process, as none usually exist.
Note: for a probability sampling method to be valid it must be possible to calculate each unit of observation’s probability of being selected, which may or may not be equal for all units, and should not be zero for any unit. For example, if the target population contained 100 individuals and we randomly select 10 individuals via simple random sampling then the probability of being sampled would be 10/100 = 0.1 for each individual.
Non-probability sampling
While desirable it is clearly often not feasible to use probability sampling methods, usually because of the lack of a suitable sampling frame. For example, if you are studying patients attending a healthcare facility over a given period and you need to collect data on sampled patients on the day they attend the facility it’s typically impossible to construct a sampling frame because you won’t know who will be visiting each day! Therefore, you may often need to resort to non-probability sampling methods such as consecutive sampling, while minimising the opportunity for any researcher sampling bias, and hope that your samples are representative of your target population.
Practice
Exercise 1: Simple random sampling
A brief overview of the method
Simple random sampling involves taking a random sample of a given size from a sampling frame, which results in each unit of observation having an equal probability of being selected. You can take a simple random sample from a sampling frame in various ways, such as by generating random numbers that correspond to IDs that are pre-allocated to all units of observation, or you can do it by simply randomly sorting the list and just selecting the first n units of observation, where n is your sample size. We will use this second approach as it’s easy to implement in Excel.
Advantages and disadvantages
Read/hide
Advantages
Easy to implement and explain.
Requires minimal data on your target population units of observation: just a list of all units of observation, plus usually some way of identifying/contacting them once selected.
Statistically efficient: you maximise the precision/power of your inferential analyses when you use simple random sampling.
Disadvantages
Although simple random sampling guarantees that you will select a sample that is representative (i.e. unbiased) of your target population on average/in the long-run, i.e. over many hypothetically repeated samples, it is not the best approach at achieving this goal for any given sample, and obviously in practice you typically only ever take one sample! This is especially the case when the population is heterogeneous (highly varied in the characteristics/relationships of interest) and/or if sample size is small, say in the tens or low hundreds rather than the high hundreds or thousands (as the sample size increases the chances of getting an unrepresentative sample decrease). As these two situations are often true it is often better to use stratified random sampling instead, if possible. Other approaches to deal with this issue also exist but are beyond the scope of this module.
If your sampling units cover a large geographical area simple random sampling can produce a very logistically inefficient, costly, and geographically spread-out sample. See cluster sampling for an possible solution to this problem.
Scenario
You work for a district governmental health department and you have been tasked with assessing the capacity and service delivery characteristics of the public primary care facilities across the whole district. Specifically, you need to report the typical staffing levels, resources and equipment levels, and services delivery levels for all public primary care facilities (e.g. the mean no. drs and nurses per facility.
However, while there are 246 such facilities within your district you only have resources to survey 50 facilities (if you could survey all 246 it would be a “census” not a sample). You run a sample size calculation that indicates a sample size of 50 will be sufficient to provide usefully precise, albeit fairly rough, estimates of these characteristics at the district level (note a sample size of 50 is typically going to be too small for a real study, but we are only practicing sampling here so it doesn’t matter). The target population to which you want to be able to generalise your results is therefore all 246 facilities.
Luckily there is an existing comprehensive list (i.e. sampling frame) of all 246 existing public primary care facilities in your district. You have a copy of this list, in the form of an Excel spreadsheet, that includes facility names, addresses and telephone numbers, plus additional information that we will use in subsequent exercises.
Our aim is therefore to take a simple random sample of 50 facilities from the list.
Exercise: taking a simple random sample using Excel
- Go into the “Datasets” folder that you should have moved to a suitable folder on your computer and load the “Health facilities list - simple random sample.xlsx” Excel spreadsheet. If you haven’t downloaded the datasets for the computer practical sessions and moved them to a suitable folder go to Datasets.
Video instructions: taking a simple random sample using Excel
Written instructions: taking a simple random sample using Excel
Read/hide
In the “Health facilities list - simple random sample.xlsx” Excel spreadsheet you will see it has various self-explanatory columns/variables including “facility_name”, “facility_address”, and “facility_tel”.
In column A (the blank column immediately to the left of the “facility_name” column) enter the following (or similar) word as a heading: random_no
Immediately under this new column in cell A2 (the first row where the facility data starts) click on the cell and type =rand() and press enter. This Excel function generates a random number between 0 and 1 to six decimal places.
Then simply ensure that cell A2 is selected (i.e. you’ve clicked on it) and then just double click on the small solid square at the very bottom right of this cell. This should copy and paste the function all the way down to the end of the facility data.
Next click on the “random_no” column heading (cell A1) and then in the menu “ribbon” click on Data and click the Filter tool. You should see little “drop-down” menu buttons appear in the right of each column heading. Click on the drop-down menu button in the “random_no” column heading (cell A1) and select Sort Smallest to Largest.
This will immediately sort all the data in order from those in the same row of the smallest random number value to the largest. Therefore, the list will now be randomly sorted! Note: the random numbers will all change immediately after sorting so they will actually no longer be ordered. This is because they are functions and will get re-calculated each time you change anything. However, this doesn’t matter because once all data have been sorted based on the original random numbers we don’t need those original values anymore.
Next, look at the bottom left of the window and you should see a tab titled Sheet1. This is the current worksheet. Click on the little + symbol to the right of this sheet to create a new worksheet. This will automatically be given the name “Sheet2”.
Now click back on the Sheet1 worksheet. You can now simply copy the details of the first 50 facilities in the newly randomly sorted list and copy them into the Sheet2 worksheet. This is now your simple random sample of health facilities. If this was a real study you could then use the contact information to recruit and plan your data collection.
Exercise 2: Stratified random sampling
A brief overview of the method
In summary, stratified random sampling involves the following three main steps.
- Define your strata. Strata are simply a set of two or more mutually exclusive and comprehensive groups that cover all your sampling frame’s units of observation. This just means that every unit of observation in your sampling frame is a member of one and only one stratum. For example, if we were sampling individuals and had data on their ages we could stratify the them (i.e. the sampling frame) based on age, most simply by splitting all individuals into two age groups, say those aged <18 and those aged 18 years or more. We will see below what to consider when selecting strata.
Note: the singular of strata is stratum, e.g. “we create many strata but sample each stratum separately”. Note: you can define n strata for any single stratification variable, and your total strata will be the product of the number of strata created for each variable. For example, if you stratify based on age, grouped into <18s and ≥18s, and sex, grouped into male or female, you have 2 x 2 = 4 strata in total. As you can see the total number of strata therefore increases rapidly with every extra stratification variable and/or group added! Note also: you cannot create a strata with no units of observation in. For example, if there were no <18 men in your sampling frame you could not create an <18-male strata group as the analysis would not work.
- Decide on the sample size for each strata. There are two different versions of stratified random sampling: one that uses “proportionate stratification” and one that uses “disproportionate stratification”.
Proportionate stratification means that the sizes of your sample’s strata are proportional to the size of the strata in the target population. For example, using the example above of stratifying by age with two groups of <18 and ≥18: if 25% of the target population were aged <18 (and therefore 75% are aged ≥18) whatever your sample size was 25% of the sample size would come from your <18 stratum and 75% from your ≥18 stratum. This would result in a representative distribution of ages in your sample and preserve the equal probability of selection for all units of observation in your sampling frame.
Disproportionate stratification is when the size of your sample’s strata is not proportional to their size in the target population. If this is the case then the units of observation in your sampling frame no longer have an equal probability of selection. This means you would have to calculate sampling weights to “map” the same back to the target population and avoid biased results when analysing the data.
- Take a simple random sample (or less commonly a systematic random sample) of the relevant size in each strata.
So we can use the same approach as we took above for the simple random sample and just repeat it within each strata the appropriate number of times.
Advantages and disadvantages
Read/hide
Advantages
The advantages and uses of stratified random sampling differ somewhat depending on whether you are using a proportionate or disproportionate stratification, and the reason for doing either depends on your goals and skills.
- For studies aiming to describe the characteristics of a target population, where you have no particular interest in any specific sub-populations (compare to scenario 2 below), compared to using simple random sampling stratified random sampling with proportionate stratification can help you to: a) reduce the chances of obtaining an unrepresentative sample, at least in terms of the characteristics represented by your chosen strata, and b) increase the precision of your estimates for a given sample size. While for studies aiming to estimate relationships within a target population, where you have no particular interest in any specific sub-populations, compared to using simple random sampling stratified random sampling can similarly help you to increase the precision with which you can estimate your relationships of interest for a given sample size, giving you a more “statistically efficient” sample.
How well you achieve these goals depends on how well your chosen strata capture characteristics or variables that account for variation in your outcomes of interest. That is, you want units of observation to be as similar (homogeneous) as possible within strata and, on average, as dissimilar (heterogeneous) as possible between strata. For example, if we are interested in estimating rates of cardiovascular events then stratifying by age makes a lot of sense, because age is one of if not the biggest “causes” of cardiovascular events, i.e. the likelihood of having experienced a cardiovascular event will be quite similar for individuals within a young-age stratum and very different for those individuals compared to individuals in an geriatric-age stratum.
Note: using proportionate stratification is not actually necessary to achieve these goals, but it results in a sample that does not need reweighting during analysis to avoid unbiased results, and calculating weights is complicated, plus there is typically no good reason to use disproportionate stratification in this case (again see scenario 2 below).
- In other situations however you be particularly interested in specific sub-populations. In this case you can use stratification to “oversample” those sub-populations to ensure you have enough sample size to estimate characteristics/relationships for those sub-populations (strata) with sufficient precision/power. For example, you may wish to ensure you can estimate certain characteristics or relationships within a certain small, ethnic minority group with sufficient precision. With a simple random sample you would, on average, take a sample from the ethnic minority group that was proportional to its population size. For example, if the ethnic minority group were just 1% of the target population and you took a sample of 1000 individuals from the target population then on average you would only sample 10 individuals from the ethnic minority group! Hardly much use. Instead you could create strata for each ethnic group and take a fixed, larger (disproportionate) sample from the relevant ethnic minority group than you would take if you were using simple random sampling. This would be using disproportionate stratification.
- However, as noted earlier if you do this the added complication is that the relative sizes for one or more strata will then, by design, not match their relative sizes in the target population. This means there is not an equal probability of selection for all sampling units, and you would have to calculate and use sampling weights to “map” the sample back onto the target population and avoid obtaining biased results. As this is a more complicated process and this type of stratified sampling is not commonly used (although a form of it is commonly used in multi-stage cluster sampling) we will not look at it further. Note: when using disproportionate stratification you would still be able to gain the advantages mentioned for scenario 1 above if your strata, either those that are disproportionately sampled or indeed other strata, capture important sources of variation within your outcomes of interest.
In the following exercise we will just look at how to implement the first approach discussed above.
Disadvantages of stratified random sampling with proportionate stratification
You require data on the characteristics needed to define your strata for all members of your target population, and it is often not possible or difficult and/or costly to obtain this data.
Compared to simple random sampling it is a somewhat more complicated and time consuming process (although this is typically a minor limitation).
When analysing data from a stratified random sample you need to use non-standard analytical methods (or non-standard versions of typical analytical methods) to account for and obtain the benefits of your stratified sample, in terms of increased precision. However, this is actually very straight forward to do with modern software and we will see how to do this in the complex survey practical sessions.
Scenario
We will use the same basic scenario as for the simple random sampling exercise previously, where we are aiming to conduct a survey of n = 50 public primary care health facilities. However, in these exercises we will take a stratified random sample. We will assume that there is likely to be substantial variation in at last some of our outcomes of interest between facilities in different sub-districts. For example, maybe some sub-districts have typically larger populations and so the facilities in those sub-districts experience a greater burden of patients leading to typically poorer outcomes of interest. To increase our chances of getting a representative sample for the target population we can therefore stratify the sampling by sub-district to ensure that no sub-district is over or under represented in the sample relative to its size. As we have the sub-district location of each facility in our sampling frame we can easily stratify our sampling by sub-district.
In the scenario/data there are just three sub-districts and they each contain quite different numbers of health facilities. We will assume that we are only interested in computing overall, district-level results for any outcomes. We will therefore need to split our sample size into three (i.e. so we have separate sample sizes for each sub-district) such that each sub-district’s sample size is proportional to the number of facilities in that sub-district. This will ensure that our results apply to the overall district level (if we used an equal sample size for each sub-district then the sub-districts with fewer health facilities in would be over-represented in the overall results, and vice versa).
Note: in a real study if you had data on additional characteristics that you thought were likely to be related to variation in the outcomes of interest you would probably further increase your chances of getting a representative and more statistically efficient sample by creating additional strata using those data.
Exercise: taking a proportionate stratified random sample using Excel
- Load the “Health facilities list - stratified random sample.xlsx” Excel spreadsheet.
Video instructions: taking a proportionate stratified random sample using Excel
Written instructions: taking a proportionate stratified random sample using Excel
Read/hide
If we know there is substantial variation in health facility characteristics on average between sub-districts we may want to reduce the chances of getting an unrepresentative sample size for one or more sub-districts and increase the precision of our estimates (for a given sample size) compared to taking a simple random sample, which is likely to happen when the sample size is very small (like n = 50). As mentioned previously in reality we would probably want to create strata based on additional variables related to our outcomes, maybe things like some measure of facility size, staffing, or resources etc, but we will just keep things simple here.
Therefore, instead of risking getting an unrepresentative distribution (number) of health facilities within each sub-district and less precise estimates, as would be likely with a simple random sample, we can take stratified random sample. Here we will see how to take a stratified sample such that the strata sample sizes are proportional to their relative population sizes. This ensures that the sample represents the distribution of strata seen in the population, which means we don’t need to weight our analyses.
In the “Health facilities list - stratified random sample.xlsx” Excel spreadsheet look at the Full facility list worksheet if it’s not already in view (you can change worksheet by clicking on the tabs at the bottom left of the window). Now look at the table titled Sub-district frequency table at the upper right of the spreadsheet. First we need to work out the proportion of facilities in each sub-district by dividing the number of facilities in each sub-district by the total number of facilities. Under the Population proportion heading click on the first empty cell for the Hills sub-district. Now enter = and then click on the Population frequency value for Hills (22), then type /, then click on the Population frequency total (at the bottom of the column: 245). The resulting formula should be: =G3/G6. Press enter to tell Excel to run the computation. The resulting value should be 0.089…
Repeat this for the other sub-districts. If you’ve done it correctly the total value under the Population proportion column should equal 1.
Next we need to use these proportions to compute the sample size required for each strata to maintain the strata sample sizes proportional to the strata population sizes. Under the Strata sample size column click on the first empty cell for the Hills sub-district. Now enter = then type ROUND(50*, then click on the cell containing the Population proportion value for Hills that you previously computed (0.089…), then type , 0). The final full function should be =ROUND(50*H3, 0). Press enter to tell Excel to run the computation. The resulting value should be 5.
Let’s explain the function you just used =ROUND(50*H3, 0). ROUND works by rounding the first value in the brackets to the number of decimal places listed after the comma. So we’ve told Excel to round our computed sub-district sample size (50 x H3 - where H3 is the proportion of health facilities in the population in that sub-district) to 0 decimal places - i.e. to the nearest integer.
Now repeat this computation for the other two sub-districts. If you’ve done it correctly the total value under the Strata sample size column should equal 50.
Next we can take our stratified random sample. As a stratified random sample is just a repeated simple random sample within each strata all you now need to do is take a simple random sample of the relevant size for each sub-district, i.e. a simple random sample of 4 health facilities in the Hills sub-district and so on. In Excel it’s probably easiest to just create separate worksheets for each strata. To save you time, because it’s not something you probably need to practice, I’ve already created these three additional worksheets (the tabs along the bottom of the Excel spreadsheet): one containing the data for each sub-district
Therefore, create a new worksheet by clicking on the little + symbol to the right of the Hills worksheet. Then repeat the simple random sample process learned in the last exercise for each of the sub-district lists in turn. After randomising each list select the required number of facilities for that sub-district (starting from the top of the randomised list) and copy their details into your sample worksheet. Refer back to that exercise for the steps if needed. Note: you could also achieve a stratified random sample with proportional strata sizes by just randomising the total list of all health facilities and then selecting the first n health facilities from each sub-district as they appear in the randomised order, where n = the required sample size for each strata. However, that would be more time consuming.
Exercise 3: Systematic random sampling
A brief overview of the method
If you can reasonably assume that your sampling frame is randomly ordered with respect to your outcomes of interest and any causal factors that are related to your outcomes of interest then systematic random sampling should produce a similarly unstructured random selection as simple random sampling, and you can use the same “versions” of typical analytical methods that can be used when analysing data from a simple random sample. If your sampling frame is grouped in relation to a categorical characteristic that is related to your outcomes of interest, or is ordered in relation to a continuous/discrete characteristic that is related to your outcomes of interest, then you gain use it to obtain a random but similarly structured sample as obtained via proportionate stratified sampling. To gain the increased precision that can come from analysing a stratified random sample you would have to use relevant stratified analysis methods.
Take your sampling frame and select a random starting point (i.e. random unit of observation).
Based on your desired sample size calculate a “skip pattern”. This is just a number which then determines how many units of observation are skipped after your starting point before sampling another unit.
Sample your random starting point and based on your skip pattern all successive units of observation that your skipped sampling pattern “lands on” until you reach the end of your sampling frame, by which time you should have sampled your desired sample size (or just under - see the exercise).
Advantages and disadvantages
Read/hide
Advantages
Stratified random sampling can be useful when sampling physical units that are ordered when you have no sampling frame, but you know (at least approximately) how many units there are. For example, if you want to sample n refugee tents in a camp that are ordered in rows but there is no list of all tents, you could use a drone to take an aerial photo and estimate the total number of tents (e.g. count a few rows to get a row average and multiply by the total number of rows). You could then pre-plan a sampling route through the camp and apply the method, taking the first tent as your first sampling frame unit and so on.
Another advantage that systematic random sampling has over simple random sampling is that it can reduce the chances that you will obtain a non-representative sample in relation to a characteristic of interest (e.g. an outcome of interest or a characteristic that is known to be causally related to your outcome of interest). For categorical characteristics this has the same result as for proportionate stratified sampling, but if there are many (often small) category levels (i.e. strata), it can be quicker and easier to use systematic random sampling, because all you need to do is order by that characteristic and then take your systematic random sample, and you will get (at least approximately) the proportionate sample size for each category level.
This also works for numerical characteristics, where stratified sampling cannot necessarily be applied easily or would force you to crudely group the sampling frame based on cut points. Here, all you need to do is order the sampling frame in relation to the numerical characteristic that is either your outcome of interest, or is assumed to be related to your outcomes of interest, and then take your stratified random sample. You will then ensure that your sample has a representative distribution of that numerical characteristic. For example, if the number of doctors in a health facility is likely to be related to the health facility outcomes we are interested in and we have data on those doctor numbers we can use systematic random sampling to ensure that we get a sample that has a distribution of doctor numbers that match the distribution in the sampling frame/target population.
Disadvantages
Systematic random sampling is a bit more complicated to implement than simple random sampling.
As with stratified random sampling you need additional data on important characteristics of your units of observation that are related to the characteristics or relationships of interest to obtain a clear benefit from this approach, and such data is often difficult, costly or impossible to obtain.
Most critically, if there is a repeating pattern to the ordering of your sampling frame in terms of the distribution of characteristics that affect your outcomes of interest then a systematic random sample may produce a seriously biased/unrepresentative sample. This is usually the case when the pattern repeats regularly. For example, if you are using systematic random sampling to sample from a list of health facilities that are ordered from small to large within each area, and your sample size is such that you’re only selecting one or a few per area, the pattern might ensure that you only select typically smaller/larger facilities in each area. Be careful!
When to use systematic random sampling?
Clearly systematic random sampling can have some advantages over simple random sampling if used carefully, but what about in relation to stratified random sampling? Remember stratified random sampling can use either simple or systematic random sampling to take the samples within each strata. Therefore, it depends on your aims and there’s no single answer for all circumstances. However, broadly speaking if you have good data on what are likely to be important strata for your characteristics or relationships of interest then a stratified random sample may be the better choice This is because compared to simple random sampling stratified random sampling is more likely to produce a representative sample while also being likely to increase the precision of your estimates (i.e. increase your statistical efficiency) compared to systematic random sampling, but it also avoids the risk of producing a biased sample that systematic random sampling can result in when there are unrecognised, typically small-scale, meaningful repeating patterns in the ordering of the sampling frame.
However, for completeness we will practice below how to take a systematic random sample when we have some additional data on important strata.
Lastly, note that one form of systematic random sampling is often used in the first stage of multi-stage cluster samples to take a non-stratified random sample of primary-stage clusters (often villages or city blocks) with probability proportional to the size of primary-stage clusters. This involves a slight modification of the approach we will see below, and as it’s typically only used in this specific circumstance we won’t look at it further.
Scenario
We will use the same basic scenario as for the simple random sampling and stratified random sampling exercises previously, where we are aiming to conduct a survey of 50 public primary care health facilities. However, in this exercise we will take a systematic random sample. We will use the stratified random sampling exercise sampling frame where the health facilities were ordered into sub-district groups. As long as there is no smaller scale pattern in the ordering of the list a systematic random sample should ensure a random sample of health facilities while also ensuring the sample is evenly distributed across the three sub-districts.
Exercise: taking a systematic random sample using Excel
- Load the “Health facility list - systematic random sample.xlsx” Excel spreadsheet.
Video instructions: taking a systematic random sample using Excel
Written instructions: taking a systematic random sample using Excel
Read/hide
To save time we have already created a numerical sequence in the first column from 1 to 245. This will be used to identify our sampled health facilities.
First, order the list by the variable “no_drs”: click anywhere on the values and from the top menu click Data then Sort. In the sort tool that appears under Column where it says Sort by click the drop-down menu and select no_drs. Then click OK. The list should now be ordered by the number of doctors per health facility, and our systematic random sample will now ensure we get a representative sample across the range of numbers of doctors per health facility.
Next, calculate the skip pattern k. This is calculated as N/n, where N = total sampling frame size and n = sample size. If the result is a whole number (integer) use this value, or if not round up to the nearest whole number. Therefore, in cell G1 enter skip and press enter. Then in cell G2 enter =245/50 and press enter. You should get the value 4.9. Therefore, we round this up so our skip pattern k = 5.
Next we need to select a random starting point between the first unit of observation and that corresponding to our skip pattern k, i.e. a random number between 1 and 5. To do this use the Excel function randbetween. In cell H1 enter the heading facility_selection_id. Then click on cell H2 and enter =randbetween(1, 5) and press enter. This will create a random value between 1 and 5 for the random starting point. Next overwrite this value with the same value by clicking on the cell and typing the number corresponding to the random value (e.g. if the random value is 3 click on the cell and type 3). This ensures that Excel won’t create a new random value each time you update any other functions.
Then click on cell H3 and type =H2+5. Then press enter. Then click again on cell H3 and then double click on the small solid square at the bottom right of the selection box that appears around cell H2 to extend the formula down to the bottom of the data. This new column then lists the IDs of the facilities that our systematic random sample has selected. Note that for this to work the original facility ID must be a count from 1 to n, where n is the maximum number of facilities. Note also that the facility_selection_id values keep going beyond the maximum value of the original facility ID.
Therefore, click on the little + symbol to the right of the Sheet1 tab to create a new spreadsheet. Then click back to Sheet1. Now copy and paste the values in the facility_selection_id column from the first value until the greatest value that is ≤245. This is your sample list of facility IDs, which should be both a random sample and implicitly stratified by number of doctors. Therefore, you would have to either go back to the original list and copy the details of each facility, as per the sampled IDs, or use the MATCH function in Excel to match up values, but we do not look at that here.
Finally, if you go to Sheet2 and highlight all the IDs and look at the bottom right of the window Excel tells us the count (i.e. number of selected rows) of IDs is 49. So we are missing one from our targeted sample size. This is just due to the rounding of the skip pattern. If the original skip pattern (before rounding) had been 5 we it would have resulted in a sample size of 50. Therefore, to get our desired sample size we just need to randomly select one more health facility by going back to the start of the list once you reach the end (e.g. for our skip pattern of 5 if your final selected facility was number 244 then we’d count 245, 1, 2, 3, and then number 4 would be our final selected facility).
Single-stage cluster sampling and multi-stage cluster sampling
The following optional information is just for awareness and understanding purposes but we will not practice either of these methods.
Read/hide
Cluster random sampling
Cluster sampling involves sampling higher-level units of observation, such as households, schools, villages etc, which contain your lower-level units of observation, typically individuals. If you then also sample units within clusters (i.e. not every unit is selected) then you are using some form of multi-stage cluster sampling method, and that is too complicated for us to look into further, but see below for a bit more detail. However, if you are sampling clusters and then selecting all eligible units within each sampled cluster for data collection you can simply use any of the methods previous covered to sample your clusters. For example, if you wanted to survey community members in a number of communities and you could construct a sampling frame listing all the households in each community you wanted to survey, but not the household members, then you could take a simple random sample of households from your list and then select all eligible individuals within every sampled household for data collection. Or you may wish to stratify the sampling if you also collected data on, say, the total size of each community etc.
Either way this would result in a representative sample (on average) with no need to re-weight the data, unlike if you had also sampled individuals within households. However, when you take a cluster sample you need to account for the clustering in your analysis. This is because standard methods of analysis assume observations (i.e. data points) are statistically independent from one another, but clearly individuals within the same household etc are not independent, and they therefore do not provide the same amount of statistical information about a population as fully independent observations would. Therefore, ignoring clustering in analyses results in falsely high levels of precision/power. We will see one way that you can account for clustering in the complex survey practical sessions.
Multi-stage clustered random sample
Multi-stage clustered random sampling is far too complicated to go into for this module, but we will see how to analyse data from multi-stage clustered samples in the complex survey practical sessions. It is typically only used by large-scale household surveys, such as the Demographic and Health Surveys by USAID and its in-country partners. Such projects almost always use professional statisticians or survey methodologists. In brief though, this approach is actually some combination of the earlier methods covered above, and usually combines a first stage systematic random sample, with probability proportional to size to select primary sampling units (usually census units, often called “enumeration areas”, that correspond to villages or city blocks), with a second stage of sampling of households, which either again uses systematic random sampling, but typically with equal probabilities of selection, or a simple random sample. However, there are many variations with possibly additional levels of sampling. This means that the probability of selection of the ultimate sampling units is never equal with such methods and complicated sampling weights need to be calculated to ensure analyses produce unbiased results.