topic badge
AustraliaVIC
VCE 11 General 2023

1.06 Measures of spread

Lesson

Introduction

Measures of spread in a quantitative (numerical) data set seek to describe whether the scores in a data set are very similar and clustered together, or whether there is a lot of variation in the scores and they are very spread out.

There are several methods to describe the spread of data, which vary greatly in complexity. It is possible to simply look at the numerical range of the entire data set, or break the data into chunks. The spread of data can also be compared to the mean, which can then be normalised for a meaningful comparison to other data sets.

This section will define the range, interquartile range, and standard deviation as measures of spread. How to break data into quartiles of any number is also explored.

Range

The range is the simplest measure of spread in a quantitative (numerical) data set. It is the difference between the maximum and minimum scores in a data set.

Subtract the lowest score in the set from the highest score in the set. That is: \text{Range }=\text{ Highest score}-\text{Lowest score}.

For example, at one school the ages of students in Year 7 vary between 11 and 14. So the range for this set is 14-11=3.

As a different example, if we looked at the ages of people waiting at a bus stop, the youngest person might be a 7 year old and the oldest person might be a 90 year old. The range of this set of data is 90-7=83, which is a much larger range of ages.

Remember, the range only changes if the highest or lowest score in a data set is changed. Otherwise, it will remain the same.

Examples

Example 1

What is the lowest score in a set if the range is 8, and the highest score is 19?

Worked Solution
Create a strategy

Use the formula \text{Range }=\text{highest score}-\text{lowest score}.

Apply the idea
\displaystyle \text{Range}\displaystyle =\displaystyle \text{Highest score}-\text{Lowest score}Write the formula
\displaystyle 8\displaystyle =\displaystyle 19-\text{lowest score}Substitute the values
\displaystyle \text{Lowest score}\displaystyle =\displaystyle 19-8Swap 8 and the lowest score
\displaystyle =\displaystyle 11Subtract 8 from 19
Idea summary
\displaystyle \text{{Range = Highest score - Lowest score}}
\bm{\text{Range}}
is the difference between the highest and the lowest score.

Interquartile range

Whilst the range is very simple to calculate, it is based on the sparse information provided by the upper and lower limits of the data set. To get a better picture of the internal spread in a data set, it is often more useful to find the set's quartiles, from which the interquartile range (IQR) can be calculated.

Quartiles are scores at particular locations in the data set-similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters. Let's look at how we would divide up some data sets into quarters now.

Make sure the data set is ordered before finding the quartiles or the median.

Here is a data set with 8 scores:

A data set with 8 scores. The scores are 1, 3, 4, 7, 11, 12, 14, 19.

First locate the median, between the 4\text{th} and 5\text{th} scores:

A data set with 8 scores. The scores are 1, 3, 4, 7, 11, 12, 14, 19, the median is located between 7 and 11.

Now there are four scores in each half of the data set, so split each of the four scores in half to find the quartiles. We can see the first quartile, Q_{1} is between the 2\text{nd} and 3\text{rd} scores, so there are two scores on either side of Q_{1}. Similarly, the third quartile, Q_{3} is between the 6\text{th} and 7\text{th} scores:

Scores 1, 3, 4, 7, 11, 12, 14, 19. Quartile 1 is between 3 and 4, quartile 3 is between 12 and 14.

Now let's look at a situation with 9 scores:

Scores 8, 8, 10, 11, 13, 14, 18, 22, 25. Quartile 1 is between 8 and 8, the median is 13, quartile 3 is between 18 and 22.

This time, the 5\text{th} term is the median. There are four terms on either side of the median, like for the set with eight scores. So Q_{1} is still between the 2\text{nd} and 3\text{rd} scores and Q_{3} is between the 6\text{th} and 7\text{th} scores.

Finally, let's look at a set with 10 scores:

A data set scores 12, 13, 14, 19, 19, 21, 22, 22, 28, 30. Quartile 1 is 14, the median is between 19 and 21, quartile 3 is 22

For this set, the median is between the 5\text{th} and 6\text{th} scores. This time, however, there are 5 scores on either side of the median. So Q_{1} is the 3\text{rd} term and Q_{3} is the 8\text{th} term.

Each quartile represents 25\% of the data set. The lowest score to the first quartile is approximately 25\% of the data, the first quartile to the median is another 25\%, the median to the third quartile is another 25\%, and the third quartile to the highest score represents the last 25\% of the data. We can combine these sections together-for example, 50\% of the scores in a data set lie between the first and third quartiles.

These quartiles are sometimes referred to as percentiles. A percentile is a percentage that indicates the value below which a given percentage of observations in a group of observations fall. For example, if a score is in the 75\text{th} percentile in a statistical test, it is higher than 75\% of all other scores. The median represents the 50\text{th} percentile, or the halfway point in a data set.

  • Q_{1} is the first quartile (sometimes called the lower quartile). It is the middle score in the bottom half of data and it represents the 25\text{th} percentile.

  • Q_{2} is the second quartile, and is usually called the median, which we have already learned about. It represents the 50\text{th} percentile of the data set.

  • Q_{3} is the third quartile (sometimes called the upper quartile). It is the middle score in the top half of the data set, and represents the 75\text{th} percentile.

The interquartile range (IQR) is the difference between the third quartile and the first quartile. 50\% of scores lie within the IQR because it contains the data set between the first quartile and the median, as well as the median and the third quartile. Since it focuses on the middle 50\% of the data set, the interquartile range often gives a better indication of the internal spread than the range does, and it is less affected by individual scores that are unusually high or low (called outliers).

Subtract the first quartile from the third quartile. That is, \text{IQR} = Q_{3}-Q_{1}

Examples

Example 2

Consider the following set of scores:33,\,38,\,50,\,12,\,33,\,48,\,41

a

Sort the scores in ascending order.

Worked Solution
Create a strategy

Arrange the scores from smallest to largest.

Apply the idea

12,\,33,\,33,\,38,\,41,\,48,\,50

b

Find the number of scores.

Worked Solution
Create a strategy

Count the total number of scores.

Apply the idea

\text{Number of scores} = 7

c

Find the median.

Worked Solution
Create a strategy

Choose the middle score in the ordered list.

Apply the idea

The ordered scores are:12,\,33,\,33,\,38,\,41,\,48,\,50

We can see that the middle score is 38, so this is the median.

d

Find the first quartile of the set of scores.

Worked Solution
Create a strategy

Use the first half of the scores excluding the median.

Apply the idea

The first half of the scores are: 12,\,33,\,33

The median of this set is 33.

So, the first quartile of the original set of scores is 33.

e

Find the third quartile of the set of scores.

Worked Solution
Create a strategy

Use the second half of the scores excluding the median.

Apply the idea

The second half of the scores are: 41,\,48,\,50

The median of this set is 48.

So, the third quartile of the original set of scores is 48.

f

Find the interquartile range.

Worked Solution
Create a strategy

We can use the interquartile range formula: \text{IQR} = Q_{3} - Q_{1}

Apply the idea
\displaystyle \text{IQR}\displaystyle =\displaystyle 48-33Substitute the quartiles
\displaystyle =\displaystyle 15Evaluate

Example 3

For the following set of scores in the bar chart to the right:

A bar chart showing scores from 30 to 70. Ask your teacher for more information.
a

Input the data in the following distribution table:

\text{Score }(x)\text{Freq }(f)fx\text{Cumulative Freq } (cf)
30
40
50
60
70
\text{Total}
Worked Solution
Create a strategy

Look for the frequencies of each scores in the bar chart

The column fx column is the multiplication of the first and second column

To find the cumulative frequency, add the frequency of each row to the cumulative frequency of the previous row.

Apply the idea

Here's the complete distribution table:

We find the total of second and third columns.

\text{Score }(x)\text{Freq }(f)fx\text{Cumulative Freq } (cf)
30530 \times 5 =1505
40540 \times 5 =2005+5=10
50550 \times 5 =2505+10=15
60160 \times 1 =601+15=16
70370 \times 3 =2103+16=19
\text{Total}19870
b

Find the median score using the distribution table above.

Worked Solution
Create a strategy

Use the cumulative frequency column in part(a) to determine the middle score.

Apply the idea

Since there are 19 scores in total, the median will be the 10th score.

Looking at the cumulative frequency table in part (a), the 10th score falls in 40. So, \text{Median}=40

c

Find the first quartile score.

Worked Solution
Create a strategy

Determine the middle value of the lower half of the scores, excluding the median, in the distribution table in part (a).

Apply the idea

So the first quartile will be the middle of the first 10 scores, which will be the \dfrac{9+1}{2}=5th score. We can see from the table that the 5th score is 30.

Q_1=30

d

Find the third quartile score.

Worked Solution
Create a strategy

Determine the middle value of the upper half of the scores, excluding the median, in the distribution table in part (a).

Apply the idea

The median in part (b) is 10th score. The third quartile will be the middle of the last 9 scores, which can be computed as 10+ \dfrac{9+1}{2}=15th score. We can see from the table that the 15th score is 50.

Q_3=50

e

Find the interquartile range.

Worked Solution
Create a strategy

Use the interquartile range formula: \text{IQR} = Q_{3} - Q_{1}

Apply the idea

We can use Q_1=30 in part (c) and Q_3=50 in part (d).

\displaystyle \text{IQR}\displaystyle =\displaystyle 50-30Substitute the quartiles
\displaystyle =\displaystyle 20Evaluate
Idea summary
\displaystyle \text{IQR} = Q_{3} - Q_{1}
\bm{\text{IQR}}
is the interquartile range which is difference between third quartile and first quartile

Standard Deviation

Standard deviation is a measure of spread, which helps give a meaningful estimate of the variability in a data set. While the quartiles gave us a measure of spread about the median, the standard deviation gives us a measure of spread with respect to the mean. It is a weighted average of the distance of each data point from the mean. A small standard deviation indicates that most scores are close to the mean, while a large standard deviation indicates that the scores are more spread out away from the mean value.

The standard deviation can be calculated for a population or a sample.

The symbols used are:

\displaystyle \text{Population standard deviation}\displaystyle =\displaystyle \sigma \text{ (lowercase sigma)}
\displaystyle \text{Sample standard deviation}\displaystyle =\displaystyle s

In Statistics mode on a calculator, the following symbols might be used:

\displaystyle \text{Population standard deviation}\displaystyle =\displaystyle \sigma _n
\displaystyle \text{Sample standard deviation}\displaystyle =\displaystyle \sigma _{n-1}

Note: It is only required to calculate standard deviation using the automatic function in the statistics mode of our calculators, so we will not go through the formal definition and equation here.

The standard deviation is found by calculating the square root of the variance.

Variance is the average of the squared differences from the mean. Here is its formula. \sigma ^2=\dfrac{1}{n}\Sigma\left(x_i-\mu \right)^2

This is the formula by which a calculator calculates the standard deviation of a data set from a full population. That is, it is the formula used for census data rather than sample data. \sigma =\sqrt{\dfrac{1}{n}\Sigma\left(x_i-\mu \right)^2}

In this formula,

  • The numbers x_i are the values in the data set. There is one value for each subscript i.

  • There are n numbers x_i in the data set. So, i goes from 1 to n in the summation.

  • The symbol \mu (Greek letter 'mu') is the population mean.

  • The Greek letter \sigma (sigma) is used for the population standard deviation.

  • The symbol \Sigma (upper case sigma) is the summation symbol.

Simply put, standard deviation describes the spread of data by comparing the distance of each score to the mean. It is complicated to calculate, but it gives a lot of information about the spread of data because it takes into account every data point in the set.

Standard deviation is also a very powerful way of comparing different data sets, particularly if there are different means and population numbers.

Examples

Example 4

Find the population standard deviation of the following set of scores by using the Statistics mode on the calculator:8,\,20,\,16,\,9,\,9,\,15,\,5,\,17,\,19,\,6

Round your answer to two decimal places.

Worked Solution
Create a strategy

Enter all the scores into your calculator using the Statistics function to find the population standard deviation.

Apply the idea

\text{Standard deviation}=5.30

Example 5

Fill in the table and answer the questions below.

a

Complete the table given below.

\text{Class}\text{Class Centre}\text{Frequency}fx
1-9 8
10-18 6
19-27 4
28-36 6
37-45 8
\text{Total}
Worked Solution
Create a strategy

To find the class centre, determine the average of the upper and lower bounds of each class.

To find the relative frequency, fx, multiply the class centre with its corresponding frequency.

Apply the idea

Here's the complete table:

\text{Class}\text{Class Centre}\text{Frequency}fx
1-9\dfrac{1+9}{2}=585 \times 8 =40
10-18\dfrac{10+18}{2}=14614 \times 6 =84
19-27\dfrac{19+27}{2}=23423 \times 4 =92
28-36\dfrac{28+36}{2}=32632 \times 6 =192
37-45\dfrac{37+45}{2}=41841 \times 8 =736
\text{Total} 32732
b

Use the class centres to estimate the mean of the data set, correct to two decimal places.

Worked Solution
Create a strategy

Use the formula: \text{Mean}=\dfrac{\Sigma fx}{\Sigma f}

Apply the idea

Looking at the table in part (a), we can see the sum of the relative frequencies to the bottom of the fourth column, which is \Sigma fx=732. The total number of scores or the sum of the frequencies can also be seen at the bottom of the third column, which is \Sigma f=32.

\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{732}{32}Substitute the values
\displaystyle =\displaystyle 23Evaluate using a calculator
c

Use the class centres to estimate the population standard deviation, correct to two decimal places.

Worked Solution
Create a strategy

Use the Statistics mode on your calculator to find the population standard deviation.

Apply the idea

Enter the class centres and their corresponding frequencies.\sigma _n=13.87

d

If we used the original ungrouped data to calculate standard deviation, do you expect that the ungrouped data would have a higher or lower standard deviation?

Worked Solution
Create a strategy

Recall the definition of the standard deviation.

Apply the idea

A small standard deviation indicates that most scores are close to the mean, while a large standard deviation indicates that the scores are more spread out away from the mean value.

So, having the original ungrouped data would have a higher standard deviation because we will have more variables and wider spread data set.

Idea summary

Standard deviation is a weighted average of how far each piece of data varies from the mean. The standard deviation can be calculated for a population (\sigma) or a sample (s).

The standard deviation is a more complex calculation but takes every data point into account. The standard deviation is significantly impacted by outliers.

For each measure of spread:

  • A larger value indicates a wider spread (more variable) data set.

  • A smaller value indicates a more tightly packed (less variable) data set.

Outcomes

U1.AoS1.4

mean 𝑥 and sample standard deviation s

What is Mathspace

About Mathspace