Guide

Reda Belmejdoub

September 12, 2019

Everything you need to know about picking the right sample size.

First, let’s define a few key concepts in statistics that we will be seeing in this article. In statistics, the term *survey* is used to define any investigation done on a collection of *records* from a given population in order to estimate its characteristics. So, you could say that populations are large numbers of records.

A population in this context doesn’t necessarily have to be a group of people; it can also pertain to clicks on a website, hotdog sales invoices, or medical logs!

When conducting a survey, it can be inconvenient, or even impossible, to obtain responses from the entire population (imagine having to survey every person in the US!). On the other hand, you can’t draw conclusions about a population if you’re surveying too small of a group. That’s why we have samples.*A sample is a portion of the population being studied that is large enough to be considered representative.*

For example, if the total population of a city is our study subject, a sample could be a select group of inhabitants from every district.

Now, with that out of the way, let’s get to the meat of the question:

How important is choosing the correct sample size?

Conducting your study while using the incorrect sample size comes with the risk of obtaining biased results, and therefore drawing incorrect conclusions. That being said, no sample is a perfect reflection of the population; that is why we’re also going to learn how to determine how far off we are from the truth (what we refer to as the “error”).

First, there is the statistician’s purely mathematical way of calculating a sample size based on a few important variables. If you want the more down to earth, simplified method of determining a sample size for a small population, just head to the next section (*The Simple, Down to Earth Method*) of this article.

This is the total number of elements in your population, i.e the population of a city, the number of employees in a company, etc.

How confident do you want to be that the actual mean falls within your confidence interval? The most common confidence intervals are 90%, 95%, and 99% confident. Simply put, if your confidence level is 95%, it means that if you conduct your survey 100 times on random samples of the same size, the results should look similar 95 out of 100 times. For common confidence levels, look at Table 1.1.

Standard Deviation

The standard deviation of a population tells us how much variance there is between each record. If the variable we’re measuring does not differ a lot between records, it would be safe to assume that the standard deviation is low. On the other hand, the higher the difference across records, the higher the standard deviation.

To calculate the standard deviation of a population, we use the formula

Or, in the case of a study of the proportion of a qualitative variable, the standard deviation is the ratio of responses in the population, p.

Margin of Error

Remember when we talked about sample sizes not being perfect? The confidence interval is, simply put, a measurement of how precisely aligned your sample size is with respect to the overall population. In political surveys, for example, you will see a number next to the results, such as (+/-5%), indicating that if 45% of your sample’s respondents stated that they would vote for candidate A, then you can reasonably expect that surveying the total population would give a result between 40% and 50%.

The general formula to find the margin of error for the sample mean* *is:

Now that we have defined these parameters, we can determine the necessary sample size by using the formula:

When using this formula, we’re supposing that certain conditions are met. The original population needs to have a normal distribution (bell shaped curve) or a sample size that is large enough that the normal distribution can be used. This method is generally not applicable to small sample sizes (below 30 in general).

We are also assuming that the population’s standard deviation is known. This is not always true, especially if no prior survey data is available. In such cases, we can either try to find a reference value from other studies, or use an estimate.

Let’s imagine we wanted to conduct a poll on the next presidential election’s results. Our entire population would be the total number of adults in the US, who are of legal voting age. Now the question becomes, how many people do we need to actually conduct the survey on?

First, we need to know our population’s size, which is 250,056,000, according to a 2016 census. Next, we need to choose the desired confidence level for our study’s results: we’ll take 95% as it is one of the most commonly used values, which makes our Z-Score 1.96.

Note, the choice of the confidence interval will influence the resulting margin of error and sample size. You can always adjust this if you find that your sample size is too large in order to reduce it, but at the expense of precision.

We don’t know our standard deviation yet, since we haven’t sent out our poll, so it’s best to choose a conservative value,which would be 0.5 (50%).

Assuming we want to obtain results within a 5% margin of error, our sample size is:

SS=1.645²*0.5*(1-0.5)/(0.05)²=384.16

This means we only need to survey about 385 people to get results with a 5% margin of error and a 95 % confidence level.

Simple enough, isn’t it? Well, not really. The problem above represents a simple example of a “votes A or B” study. Studying continuous variables and more intricate data is not as easy as applying the formula above. There’s also the concern of the availability of prior data, such as the standard deviation. You may simply not have access to the data, resulting in the need to make estimates of your own or refer to other resources to find other ways around it. So, now is a good time to go over the reality of what you will probably be doing.

As a manager or team lead, you will usually be confronted with relatively small populations. As such, you will have a better grasp on the data you’re dealing with and have an easier time conducting your study on the whole population. In cases where this is not possible you can still ballpark your population size to get results that will be serviceable enough to use.

When choosing your sample size, you’ll need to keep the following in mind:

- What are the constraints you’re dealing with?
- How much time and effort can you put into the surveying process?
- What kind of information are you trying to extract?
- Do the results you’re obtaining go against your intuition?

If your population size is less than 100, it would be better to conduct a survey on the whole population. When the size of a population decreases, outliers become more prominent and if you didn’t include the population as a whole, you could miss some very important trends.

Do not restrict yourself to an arbitrary number, listen to your intuition, adapt your choices, and try different things. If you see that getting a larger sample would provide you with useful insights, consider it. If you’re fairly confident that a smaller sample is enough to paint the picture you need, use it. You’ll have access to limited time and resources, so be sure to optimize them.