When describing the shape of data sets, it is often useful to focus on how the data is distributed and whether the shape is symmetrical or not. Recall that the  measures of centre previously explored were the median, mean and mode. Skew is considered relative to a central measure.
Data may be described as symmetrical or asymmetrical.
There are many cases where the data tends to be around a central value with no bias left or right. In such a case, roughly 50 \% of scores will be above the mean and 50 \% of scores will be below the mean. In other words, the mean and median roughly coincide.
If a data set is asymmetrical instead (i.e. it isn't symmetrical), it may be described as skewed.
A data set that has positive skew (sometimes called a 'right skew') has a longer tail of values above the peak of the graph, such that more than half of the scores are above the peak. This means the mean is greater than the median, which is greater than the mode: \text{mode} < \text{median} < \text{mean}
A positively skewed graph looks something like this:
A data set that has negative skew (sometimes called a 'left skew') has a longer tail of values below the peak of the graph, such that more than half of the scores are below the peak. This means the mode is greater than the median, which is greater than the mean: \text{mean} < \text{median} < \text{mode}
A negatively skewed graph looks something like this:
State whether the scores in each histogram are positively skewed, negatively skewed or symmetrical (approximately).
A distribution is said to be symmetric if its left and right sides are mirror images of one another.
A uniform distribution is a symmetrical distribution where each outcome is equally likely, so the frequency should be the same for each outcome.
A data set that has positive skew (sometimes called a 'right skew') has a longer tail of values to the right of the data set. The mass of the distribution is concentrated on the left of the figure. Most of the scores are relatively low.
A data set that has negative skew (sometimes called a 'left skew') has a longer tail of values to the left of the data set. The mass of the distribution is concentrated on the right of the figure. Most of the scores are relatively high.
In a set of data, a cluster occurs when a large number of the scores are grouped together within a very small range.
The shape of data also shows us whether there are any outliers or unusually high or low values in a data set.
For example, in the dot plot below, do you see how all the ages range between 12 and 14 except one? This means that 24 is an outlier.
In this case, the outlier is very obviously way outside the range of the rest of the data set.
The formal definition of an outlier is a score that is more than 1.5 \times \text{IQR} above the upper quartile, or less than 1.5 \times \text{IQR} below the lower quartile. This will be discussed further in a later section.
Modality describes the prevalence of local peaks in a data set. The peaks don't necessarily need to be the mode of the whole data set, but rather a local cluster of data that is more frequent and stands out from the surrounding data. When looking at the modality of a data set, it is usually useful to examine a graph of the data. Modality is described by the number of peaks.
For the Stem and Leaf plot attached:
Stem | Leaf |
---|---|
0 | 5 |
1 | 7\ 8 |
2 | 0\ 8 |
3 | 1\ 3\ 3\ 7\ 8\ 9 |
4 | 1\ 3\ 5\ 8\ 8\ 8 |
5 | |
6 | |
7 | |
8 | |
9 | 2 |
Key 1\vert 2 = 12 |
Are there any outliers?
Identify the outlier.
Is there any clustering of data?
Where does the clustering occur?
What is the modal class(es)?
The distribution of the data is:
How many peaks are there on the graph?
In a set of data, a cluster occurs when a large number of the scores are grouped together within a very small range.
An outlier is a value that is either noticeably greater or smaller than other observations.
Modality is described by the number of peaks. The peaks don't necessarily need to be the mode of the whole data set but rather a local cluster of data that is more frequent and stands out from the surrounding data.
It is important to note that certain features in a data set can significantly affect one or more of the three measures of central tendency (the mean, median and mode).
Remember the mode is the most frequently occurring score. So, if a data set has a significant number of repeated scores, then the mode could be a good measure of centre.
If the range of scores is reasonably small and there are no outliers, then the mean is an appropriate measure of centre.
Unlike the mean, the median is not affected by outliers. So, the median is a good measure of central tendency if a data set has outliers or a large range.
The shape of the data may also determine which measure of central tendency is the most appropriate measure of a data set.
If a data set is symmetrical, then the mean and median will be approximately equal. If the data is unimodal (has only one mode) then the mode will also be approximately equal. If the data has more than one mode (e.g. if it is bimodal) then the modes may be different to the mean and median.
When data is positively skewed, the mean is the highest measure of central tendency and the mode is the lowest measure of central tendency. For positively skewed data: \text{mode}<\text{median}<\text{mean}
When data is negatively skewed, the mode is the highest measure of central tendency and the mean is the lowest measure of central tendency. For negatively skewed data: \text{mean}<\text{median}<\text{mode}
Therefore, in skewed data, the most appropriate measure of central tendency will be the median.
Here is a basic summary of selecting an appropriate measure of central tendency. Note that often it can be helpful to consider more than one measure.
Data set ... | Mean | Median | Mode |
---|---|---|---|
has outliers | yes | ||
has many repeated values | yes | ||
has a relatively small range | yes | ||
is skewed | yes |
Of course, sometimes the context of the data being analysed lends itself to particular measures as well.
Which measure of centre would be best for the following data set? 15,\,13,\,16,\,17,\,15,\,15,\,15
Every week over 45 weeks, a kayaking club runs social sessions that are open to the public. On each session, the number of people who attend is recorded. The results are displayed in the table provided.
Number of people attending | Number of weeks |
---|---|
12 | 6 |
13 | 5 |
14 | 6 |
15 | 5 |
16 | 6 |
17 | 5 |
18 | 6 |
19 | 5 |
20 | 6 |
Considering the distribution of the responses, which of the following is true?
Here is a basic summary of selecting an appropriate measure of central tendency. Note that often it can be helpful to consider more than one measure.
Data set ... | Mean | Median | Mode |
---|---|---|---|
has outliers | yes | ||
has many repeated values | yes | ||
has a relatively small range | yes | ||
is skewed | yes |
Of course, sometimes the context of the data being analysed lends itself to particular measures as well.