Sampling Methods and Bias with Surveys: Crash Course Statistics #10 - YouTube

Channel: CrashCourse

[3]
Hi, I’m Adriene Hill and welcome back to Crash Course Statistics.
[6]
In our last episode we talked about how we use experiments to imitate having two parallel
[10]
universes to test things.
[12]
But sometimes you can’t do certain experiments without becoming an all-powerful and evil
[16]
dictator, and since it’s statistically unlikely that any of you are evil dictators, today,
[21]
we’ll explore those methods.
[23]
Like we mentioned at the beginning of the series, you’re not always able to answer
[26]
the questions you really want to answer using statistics.
[30]
For example, it would be great to experimentally test whether getting married increases your
[34]
lifespan, but you can’t randomly assign some people to be married and force another
[39]
group to be single.
[41]
Not only would that be difficult to enforce, it would also be pretty unethical, though
[45]
I suppose you being evil takes care of that particular concern.
[49]
Similarly we can’t assign someone to be a twin, or a Democrat, or a smoker.
[53]
But that doesn’t mean we should just give up and stop trying to find out more about
[57]
these topics.
[58]
Not at all.
[59]
Instead we just need a different method to collect data.
[62]
Enter Non-Experimental methods.
[64]
INTRO
[74]
One of the most common non-experimental method is the survey.
[78]
From user experience surveys on websites, to political polls, to health questionnaires
[82]
at the doctor’s office, you’ve probably taken hundreds of surveys in your lifetime.
[86]
There are two things that can make or break a survey: the questions, and who the researcher
[91]
gives the questions to.
[92]
The goal of a survey is to get specific information.
[95]
Say you’re walking your dog in a local park, and someone approaches you and asks you to
[99]
take a survey on local businesses in your town.
[102]
When you look at the questions you notice that none of them are about local businesses,
[106]
instead you find yourself answering questions about your politics and religious beliefs.
[110]
Unless the surveyor was lying to you about their purposes, this is not a very good survey
.It’s
[114]
also not a very good lie A survey should measure what it claims to
[118]
measure.
[119]
It might seem obvious that having only unrelated questions on your survey is problematic, and
[124]
there are even more subtle ways a question can be biased.
[127]
Let’s take a look at a few questions from a health survey you might take at a doctor’s office.
[131]
The first question asks you how often you exercise: never, less than 30 minutes a week
[136]
or 30 minutes a day.
[137]
So what do you answer if you exercise for half an hour twice a week?
[140]
Or if you’re on swim team and exercise for at least an hour a day?
[143]
And does walking count as exercise?
[146]
Multiple choice questions that don’t offer all possible options and/or an “Other”
[150]
option can cause respondents to either skip the question, or feel forced to choose an
[154]
answer that isn’t accurate.
[156]
Claims made using these questions aren’t as strong as they could be if people were
[160]
offered a full range of choices.
[162]
The next question asks you “Answer yes or no: I don’t smoke because I know it’s
[167]
damaging to my health” this is a leading question since the wording leads to towards
[171]
the quote “desired” answer.
[173]
This is especially effective when a question deals with sensitive issues like smoking,
[178]
politics, or religion.
[180]
People answering the questions want to be seen in a positive light, and so they tend
[183]
to give the answer they think is “appropriate”.
[185]
While having people fill surveys out anonymously by themselves can help, it can sometimes be
[190]
the case that respondents don’t want to admit things--even to themselves--that are
[194]
socially undesirable.
[195]
In general terms, good survey questions are worded in a neutral way such as asking “how
[200]
often do you exercise” or “describe your smoking habits” instead of using wording
[205]
or options that push survey takers in a certain direction.
[208]
And while your doctor wouldn’t...or shouldn’t...do this...sometimes groups purposely use biased
[213]
questions in their surveys to get the results that they want.
[216]
Apparently, back in 1972, Virginia Slims conducted a poll asking respondents if they would agree
[223]
with the statement: “There won’t be a woman President of the United States for a
[228]
long time and that’s probably just as well.”
[231]
Not a well-written question.
[233]
Biased questions can be more subtle...and can lead to skewed reports of very serious
[236]
things like sexual assault, or mental health conditions.
[239]
It’s important to always look for biased questions in surveys, especially when the
[244]
people giving the survey stand to benefit from a certain response.
[247]
Even when researchers have created a non-biased survey, they still need to get it into the
[252]
right hands.
[253]
Ideally, a survey should go to a random sample of the population that they’re interested
[257]
in.
[258]
Usually this means using a random number generator to pick who gets the survey.
[262]
We do Simple Random Sampling so that there’s no pattern or system for selecting respondents
[267]
and each respondent has an equal chance of being selected.
[270]
For example, telephone surveys often use Random Digit Dialing which selects 7 random digits
[274]
and dials them.
[276]
When someone picks up, they’re asked to take a survey.
[278]
But here’s where we hit our first issue.
[279]
If people aren’t forced to respond to the survey, we might experience something called
[283]
Non-Response Bias in which the people who are most likely to complete a survey are systematically
[289]
different from those who don’t.
[291]
For example, people with non-traditional working schedules like retirees, stay at home parents,
[295]
or people who work from home might be more likely to answer a middle-of-the-day phone
[300]
survey.
[301]
This is a huge problem if those groups are different than the population as a whole.
[305]
If your survey was on health insurance plans, or political opinions, it’s likely that
[309]
these three groups would have different opinions than the population, but they represent the
[313]
majority of survey responses, which means your data won’t represent the total population
[319]
very well.
[320]
This is also related to Voluntary Response Bias in which people who choose to respond
[324]
to voluntary surveys they see on Facebook...or Twitter... are people who again, are different
[330]
than the broad population.
[331]
This is especially true with things like customer service surveys.
[334]
People who respond tend to have either very positive or very negative opinions.
[339]
See the comment section below.
[341]
The majority of customers with an average experience tend not to respond because service
[346]
wasn’t noteworthy.
[347]
Wait.
[348]
Does that mean I’m not noteworthy?
[350]
Another source of bias is just plain underrepresentation.
[353]
If a group of interest is a minority in the population, random sampling paired with response
[358]
biases might mean that that minority isn’t represented at all in the sample.
[363]
Let's say there is a city where 5% of the population is single mothers, it’s entirely possible
[368]
that the sample will contain no single moms.
[371]
To overcome these issues, we have a couple options.
[373]
We could weight people’s responses so that they match the population (like, counting
[377]
the few single mothers who do respond multiple times so that they count for 5% of the total sample).
[384]
But, this can be problematic for the same reasons that response bias is problematic.
[388]
If the few single mothers who respond don’t represent all single mothers, our data is
[393]
still biased.
[394]
In a 2016 LA Times/USC political tracking poll, a 19-year-old black man was one of 3,000
[400]
panelists who was interviewed week after week about the upcoming presidential election.
[404]
Because he was a member of more than one group that was underrepresented in this poll, his
[409]
response was weighted 30x more than the average respondent.
[413]
According to the New York Times, his survey boost his candidate’s margins by an entire
[418]
percentage point.
[419]
Stratified Random Sampling is another option.
[422]
It splits the population into groups of interest and randomly selects people from each of the
[426]
“stratas” so that each group in the overall sample is represented appropriately.
[431]
Researchers have used stratified sampling to study differences in the way same-sex and
[435]
different-sex couples parent their kids.
[437]
They randomly select people from the same-sex parenting group and... randomly select people
[441]
from a different-sex group of parents to make sure that they’re well represented in the
[445]
sample.
[446]
Another issue is that getting surveys to people can be expensive.
[449]
If a cereal company wants to see how families react to their new cereal, it would be costly
[454]
to send some cereal to a random sample of all families in the country.
[457]
Instead they use Cluster Sampling which create clusters (not Honey Nut Clusters) that are naturally occuring (like
[464]
schools, or cities) and randomly select a few clusters to survey instead of randomly
[469]
selecting individuals.
[470]
For this to work, clusters cannot be systematically different than the population as a whole and
[476]
and they should about equally represent all groups.
[479]
Issues can also arise when the population being surveyed is very small or difficult
[483]
to reach, like children with rare genetic disorders, or people addicted to certain drugs.
[488]
In this case, surveyors may choose to not use randomness at all, and instead use Snowball Sampling.
[493]
That’s when current respondents are asked to help recruit people they know from the
[496]
population of interest... since people tend to know others in their communities and can
[500]
help researchers get more responses.
[502]
And note that these sampling techniques can and are used in experiments as well as surveys.
[507]
There are also non-experimental data collection methods like a Census.
[511]
A Census is a survey that samples an ENTIRE population.
[515]
The United States conducts a Census every 10 years, with the next one scheduled to be
[519]
done in 2020.
[520]
It attempts to collect data from every.single.resident of the United States (even undocumented residents,
[525]
and homeless residents).
[526]
As you can imagine, this is hard, and it is not without error.
[528]
In Medieval Europe, William the I of England conducted a census in order to properly tax
[533]
the people he had conquered.
[535]
In fact a lot of rulers tended to use censuses to know just how much money they should be
[539]
demanding.
[540]
Until the widespread availability of computers, the US census data took almost 10 years to
[545]
collect and analyze.
[547]
Meaning that the data from the last census wasn’t even available until right before
[551]
the next census.
[552]
The length of time it took to complete the census is part of the reason we even have
[557]
computers...check out our CompSci series for more on that.
[559]
So why collect census data--instead of just sampling the population?
[563]
In the US--the Census could cost more than 15 Billion dollars in 2020.
[567]
There are a lot of reasons.
[568]
The constitution says we have to, but also the census provides the truest measure of
[572]
the population we can get.
[574]
It minimizes sampling error.
[575]
It also functions as a benchmark for future studies.
[578]
And a census can give researchers really specific information about small groups of the population--information
[584]
that might be hard to gather with regular sampling methods.
[588]
Doing statistics on Census data is different, because most statistical inference aims to
[592]
take a small sample and use it to make guesses about that population.
[596]
But with a census we already have data from the entire population, we don’t need to
[601]
guess if there are differences, we can just see them.
[603]
Analysis on Census data is usually more concerned with whether differences we see are large
[607]
enough to make a difference in everyday life, rather than guessing IF there is a relationship.
[611]
The census as we said can take years.
[613]
And entire countries to fund.
[615]
That doesn’t discount the value of sampling.
[617]
But we should be cautious...Badly worded polls, fake polls, and biased polls are common.
[623]
So are the results of those polls.
[624]
The statistics-friendly website FiveThirtyEight put together a great list of advice on how
[629]
not-to-fall for a fake poll.
[631]
Among its advice--Ask yourself if it seems professional.
[634]
Check to see who conducted the poll--and if you trust them.
[637]
See how the poll was conducted.
[639]
Check out the questions they asked...and who they asked.
[642]
If it seems fishy.
[643]
It probably is fishy.
[644]
That said, well done surveys are essential.
[647]
They allow us to get information without all the trouble of doing an experiment, and since
[651]
they’re comparatively easy, they’re popular ways for businesses, countries, and even Youtube
[656]
channels to collect information.
[657]
In fact Crash Course Statistics has its own survey! The Link is in the description.
[662]
And it takes way less time than the Nerdfighteria one. I promise.
[665]
Thanks for watching. I'll see you next time.