topic badge
United States of AmericaVA
Algebra, Functions, and Data Analysis

5.04 Formulating investigative questions and collecting bivariate data

Formulate questions and collect bivariate data

The statistical investigation process is a process that begins with the need to solve a real-world problem and aims to reflect the way statisticians work. The data cycle gives us a nice structure to follow:

A data cycle with four stages. At the top, there is Formulate questions represented by a speech bubble with a question mark. To the right, Collect or acquire data is shown with an icon of a person and a magnifying glass. At the bottom, Organize and represent data is illustrated with a dot plot. To the left, Analyze and communicate results is indicated by a person with charts. Clockwise arrows are drawn from one stage to the next.

Recall that bivariate data is data which is represented by two variables. This data is typically numerical and can be organized with a scatterplot.

We need to identify the variables we want to study in our data cycle. Identifying the independent variable and dependent variable is important for formulating our questions and for accurate data analysis.

While bivariate data often involves two numerical variables, we can also explore relationships where one variable (or even both) is categorical. For example, we might compare numerical data across different groups.

Categorical variables can be added to a scatterplot using color or different symbols.

A graph with age in years on the horizontal axis and height in centimeters on the vertical axis. Dots show the heights of different people at various ages. Green dots represent females and blue dots represent males. Speak to your teacher for more information.

Once we know what variables we will be exploring, we need to formulate a question that requires the collection of data that can be analyzed using a data display.

Statistical question

A statistical question that can be answered by collecting data and whose answer may vary depending on the sample the data is collected from.

When dealing with bivariate data, we often use the term investigative question to specifically refer to a statistical question exploring the relationship between two variables.

When formulating questions, consider where the data will come from (the data source). We can consider:

  • What population do we want to make a conclusion about?

  • How can we find relevant data? Is it easy to acquire secondary data that already exists?

  • Who will be using the conclusions from the analysis?

Once the question has been formulated, we need to determine how to collect or acquire the necessary data. Here are some ways to collect data:

  • Research using secondary sources to find existing data.

  • Surveys can be done by asking each member of the representative sample two questions or giving them a questionnaire. Answers can be more open ended than a poll.

  • Observations can be made by watching members of the sample and noting particular characteristics.

  • Scientific experiments can be done by carefully selecting a sample and controlling as many other variables as possible, then changing the independent variable to see how the dependent variable responds.

Exploration

There are many formulas and ways to estimate or predict someone's adult height such as doubling their height at age 2 or using a combination of their biological parents' heights. With a partner or in a small group, use the data cycle to explore relationships involving adult height.

  1. Brainstorm potential statistical questions.

  2. What would the variables be for each question? Are they all numerical or are some categorical?

  3. Do you think there could be a single model that would be accurate across all demographics like race and gender? Explain.

  4. Which data collection method might be best for exploring this relationship?

When doing a survey or using secondary sources, it is important that the data is collected from a sample that is representative of the population, so that our analysis of the data is valid.

Representative means that characteristics of the population should be similar to the sample. Here are a few ways collecting a representative sample:

  • Having a sample that is large enough to represent the characteristics of the population. The larger the sample size, the closer the results will be to that of the population.

  • Having a sample that is selected without strategically choosing more people from a certain group.

  • Randomly selecting the sample helps ensure it is representative and avoids unintentionally introducing sampling bias.

A concept of sampling from a population. On the left is a large circle labeled Population containing many diverse cartoon faces representing individuals. On the right is a smaller circle labeled Sample with a subset of the faces from the population, connected by an arrow indicating selection from the larger group to the smaller.

In this course, we often use larger data sets because they usually give us a better picture of the whole population.

Bias in the data cycle can lead to misleading or inaccurate conclusions.

Statistical bias

Anything in the data process that causes our findings from a sample to be systematically different from what is true for the whole population.

Example:

Sampling bias, Observer bias, Measurement bias

Sampling bias can occur due to undercoverage (not enough people from a certain group) or exclusion (leaving out a group entirely). For example, surveying only smartphone users about internet habits would exclude people without smartphones.

Examples

Example 1

First identify the variables, then write an investigative question related to each scenario.

a

The local souvenir shop has noticed that their sweatshirt sales seem to be related to the temperature outside. They want to investigate this relationship more closely.

Worked Solution
Create a strategy

When creating an investigative question, it should focus specifically on the relationship between the two variables involved, which in this case are temperature and sweatshirt sales.

Apply the idea

Independent variable: Temperature

Dependent variable: Sweatshirt sales

Some possible investigative questions might be:

  1. Does a decrease in temperature correlate with an increase in sweatshirt sales?

  2. How does temperature relate to the number of sweatshirts sold?

  3. Is there a specific temperature range that results in the highest sweatshirt sales?

Reflect and check

Each of these questions investigates a different aspect of the relationship, making them effective investigative questions.

If we wanted to be even more explicit, we could add "For the local souvenir shop" to the beginning of each question to make the population clear. However, this should be clear based on the scenario.

b

For a school in Fairfax, VA, the principal noticed that the number of days missed by a student in September is a good predictor of the number of total days they will be absent throughout the year. She wants to investigate this relationship.

Worked Solution
Create a strategy

An investigative question must require the collection of data to answer it. Once we have identified the variables, we can formulate the question.

Apply the idea

Independent variable: Number of days absent in September

Dependent variable: Number of total days absent throughout the year

Some possible investigative questions might be:

  1. How does the number of days missed in September relate to the number of total days absent throughout the year?

  2. Does a high number of missed days in September correlate to a high number of total days absent throughout the year?

  3. Is there a threshold number of days missed in September that would indicate a high number of total days absent throughout the year?

Reflect and check

After one iteration of the data cycle, we might come up with a different question that explores something we noticed in the analysis.

If we want to be more specific, we could add "For a school in Fairfax, VA" to the beginning of each question to make the population clear.

c

A baker wants to adjust his pricing model to be more competitive. He wants to look at the price he charged for a cake compared to the time it took to create. He is curious if he would need specific models for wedding cakes versus to birthday cakes or if the same model would be appropriate for all kinds of cakes.

Worked Solution
Create a strategy

The investigative question should focus on the relationship between the two numerical variables involved.

There is also a categorical variable that could be used for deeper exploration.

Apply the idea

Independent variable: Time it took to create a cake

Dependent variable: Price charged for a cake.

Categorical variable: Type of cake.

Some possible investigative questions might be:

  1. Does the price charged for a cake correspond to the time taken to create it?

  2. Is a higher price charged for a cake that took longer to create?

  3. Is there a specific time duration for creating a cake that would result in a higher price being charged?

Reflect and check

After one iteration of the data cycle, we could ask the follow-up questions by splitting the scatterplot into two or using different symbols or colors for the points to see if the relationship is the same.

  1. Does the price charged for a cake correspond to the time taken to create it? Is this relationship the same or different for wedding and birthday cakes?

  2. Is a higher price charged for a cake that took longer to create? Is this consistent for birthday and wedding cakes?

  3. Is there a specific time duration for creating a cake that would result in a higher price being charged? Is this range the same for wedding and birthday cakes?

Example 2

For each investigative question, select which data collection technique would be best. Explain your answer.

A
Observation
B
Polls
C
Research
D
Scientific experiment
E
Survey
a

For the intersection of Chain Bridge Road and Eaton Place, Louisa notices that sometimes she can walk through easily, but sometimes she gets stuck in a crowd.

She asks the question "For the intersection at Chain Bridge Road and Eaton Place, how can the relationship between pedestrian density (people per square yard) and walking speed (feet per second) be modeled?"

Worked Solution
Create a strategy

To identify the best data collection technique for the investigative question, we should first identify the variables and then select the most feasible approach to collect or measure data.

Apply the idea

The variables are:

  • Independent variable: Pedestrian density

  • Dependent variable: Average walking speed

Based on the variables, the best data collection technique for the investigative question is observation. We can measure pedestrian density and average walking speed by observing, such as using video footage from overhead cameras and drones.

This method is also good because it doesn't disturb people ('not intrusive') and gives clear, accurate data we can measure ('quantifiable data').

The answer is option A.

Reflect and check

Let's examine other data collection techniques and their applicability to the given investigative question:

  • Polls - this technique may not give an accurate measurement of the variables as the data collected would rely on people's guesses or estimates, which might not be accurate.

  • Research - it is unlikely that we could find secondary data on pedestrian density and average walking speed for that specific intersection.

  • Scientific experiment - this technique requires selecting samples and controlling variables, which is not ideal because implementing it would be challenging.

  • Survey - this technique requires direct interaction with the samples and is not feasible for measuring pedestrian density and walking speed.

b

Polly loves attending concerts, but finds that she often can't see the stage because of the taller people in front of her. This leads her to ask the question:

"For those who attend concerts at the local venue, is there a relationship between height and amount spent on concerts in a year?"

Worked Solution
Create a strategy

To identify the best data collection technique, we should first identify the variables and then select the easiest and most accurate approach to collect or measure data.

Apply the idea

The variables are:

  • Independent variable: Height

  • Dependent variable: Amount spent on concerts in a year

Based on the variables, the best data collection technique for the investigative question is a survey. Surveys allow us to gather personal or private information such as height and spending habits.

To gather this data, she could survey concert attendees. To ensure the sample is representative, she should aim to reach a diverse group of people, perhaps through the venue's mailing list (if diverse) or by surveying attendees across different types of concerts held there, rather than just sampling her friends.

The answer is option E.

Reflect and check

If she wanted to do another iteration of the data cycle, she could explore this relationship for different venues with and without tiered seating.

Idea summary

Bivariate data is data which is represented by two variables. This data is typically numerical and can be organized with a scatterplot.

To explore bivariate data, we first need to formulate an investigative question and then we can determine how to collect or acquire the necessary data. Such as:

  • Research using secondary sources to find existing data.

  • Surveys can be done by asking each member of the representative sample two questions or giving them a questionnaire.

  • Observations can be made by watching members of the sample and noting particular characteristics.

  • Scientific experiments can be done by controlling other variables, then varying the independent variable and measuring the dependent variable.

When collecting data, it is crucial to use a representative sample and avoid statistical bias to ensure the conclusions drawn are valid for the population being studied.

Outcomes

AFDA.DA.1a

Formulate investigative questions that require the collection or acquisition of bivariate data, where exactly two of the variables are quantitative.

AFDA.DA.1b

Collect or acquire bivariate data from a representative sample to answer an investigative question.

What is Mathspace

About Mathspace