
The statistical investigation process is a process that begins with the need to solve a real-world problem and aims to reflect the way statisticians work. The data cycle gives us a nice structure to follow:
Recall that bivariate data is data which is represented by two variables. This data is typically numerical and can be organized with a scatterplot.
We need to identify the variables we want to study in our data cycle. Identifying the independent variable and dependent variable is important for formulating our questions and for accurate data analysis.
While bivariate data often involves two numerical variables, we can also explore relationships where one variable (or even both) is categorical. For example, we might compare numerical data across different groups.
Categorical variables can be added to a scatterplot using color or different symbols.
Once we know what variables we will be exploring, we need to formulate a question that requires the collection of data that can be analyzed using a data display.
When formulating questions, consider where the data will come from (the data source). We can consider:
What population do we want to make a conclusion about?
How can we find relevant data? Is it easy to acquire secondary data that already exists?
Who will be using the conclusions from the analysis?
Once the question has been formulated, we need to determine how to collect or acquire the necessary data. Here are some ways to collect data:
Research using secondary sources to find existing data.
Surveys can be done by asking each member of the representative sample two questions or giving them a questionnaire. Answers can be more open ended than a poll.
Observations can be made by watching members of the sample and noting particular characteristics.
Scientific experiments can be done by carefully selecting a sample and controlling as many other variables as possible, then changing the independent variable to see how the dependent variable responds.
There are many formulas and ways to estimate or predict someone's adult height such as doubling their height at age 2 or using a combination of their biological parents' heights. With a partner or in a small group, use the data cycle to explore relationships involving adult height.
Brainstorm potential statistical questions.
What would the variables be for each question? Are they all numerical or are some categorical?
Do you think there could be a single model that would be accurate across all demographics like race and gender? Explain.
Which data collection method might be best for exploring this relationship?
When doing a survey or using secondary sources, it is important that the data is collected from a sample that is representative of the population, so that our analysis of the data is valid.
Representative means that characteristics of the population should be similar to the sample. Here are a few ways collecting a representative sample:
Having a sample that is large enough to represent the characteristics of the population. The larger the sample size, the closer the results will be to that of the population.
Having a sample that is selected without strategically choosing more people from a certain group.
Randomly selecting the sample helps ensure it is representative and avoids unintentionally introducing sampling bias.
In this course, we often use larger data sets because they usually give us a better picture of the whole population.
Bias in the data cycle can lead to misleading or inaccurate conclusions.
Sampling bias can occur due to undercoverage (not enough people from a certain group) or exclusion (leaving out a group entirely). For example, surveying only smartphone users about internet habits would exclude people without smartphones.
First identify the variables, then write an investigative question related to each scenario.
The local souvenir shop has noticed that their sweatshirt sales seem to be related to the temperature outside. They want to investigate this relationship more closely.
For a school in Fairfax, VA, the principal noticed that the number of days missed by a student in September is a good predictor of the number of total days they will be absent throughout the year. She wants to investigate this relationship.
A baker wants to adjust his pricing model to be more competitive. He wants to look at the price he charged for a cake compared to the time it took to create. He is curious if he would need specific models for wedding cakes versus to birthday cakes or if the same model would be appropriate for all kinds of cakes.
For each investigative question, select which data collection technique would be best. Explain your answer.
For the intersection of Chain Bridge Road and Eaton Place, Louisa notices that sometimes she can walk through easily, but sometimes she gets stuck in a crowd.
She asks the question "For the intersection at Chain Bridge Road and Eaton Place, how can the relationship between pedestrian density (people per square yard) and walking speed (feet per second) be modeled?"
Polly loves attending concerts, but finds that she often can't see the stage because of the taller people in front of her. This leads her to ask the question:
"For those who attend concerts at the local venue, is there a relationship between height and amount spent on concerts in a year?"
Bivariate data is data which is represented by two variables. This data is typically numerical and can be organized with a scatterplot.
To explore bivariate data, we first need to formulate an investigative question and then we can determine how to collect or acquire the necessary data. Such as:
Research using secondary sources to find existing data.
Surveys can be done by asking each member of the representative sample two questions or giving them a questionnaire.
Observations can be made by watching members of the sample and noting particular characteristics.
Scientific experiments can be done by controlling other variables, then varying the independent variable and measuring the dependent variable.
When collecting data, it is crucial to use a representative sample and avoid statistical bias to ensure the conclusions drawn are valid for the population being studied.