Summary statistics
Often the most useful thing you can do with a new data set is to get a feel for its contents by generating some summary statistics. A statistic is simply a number giving a measurement of some attribute. Summary statistics, as their name suggests, are numbers which summarise the dataset.
You have probably already come across three measures of central tendency: mean, median and mode. To recap: To obtain the mean, add all the values and divide by the number of cases. The mean can be a skewed description of the central tendency if there are a small number of very large or very small values, so you might prefer to use the median: Rank the values in order and select the middle value. In a dataset with an even number of values, find the mean of the two in the middle. To obtain the mode, rank all the values by frequency and select the value that occurs most often.
These statistics tell you about the ‘location’ of the middle of the data, but they don’t tell you about the dispersion, or how widely spread the values are. The mean height of three buildings whose heights respectively are 14 metres, 15 metres and 16 metres is the same as three buildings of 10 metres, 15 metres and 20 metres. You might want to know the range, i.e. the difference between the minimum and maximum values. You might also want to know the quartiles, i.e. the points at which the data are divided equally into quarters. The inter-quartile range tells you how much of the data is in the middle half. By generating a boxplot you can display this information graphically.
Imagine you have asked 20 children aged 13 to 14 how much weekly pocket money they receive, and recorded the answers:
UID
Pocketmoney
Age
1
2.00
13
2
5.00
14
3
20.00
14
4
10.00
13
5
8.50
14
6
5.50
13
7
6.25
13
8
7.00
14
9
10.00
13
10
9.00
13
11
8.25
14
12
12.50
13
13
11.00
13
14
11.25
13
15
9.50
13
16
25.00
14
17
50.00
13
18
5.00
14
19
8.50
13
20
14.00
14
Mean: 11.91
Median: 9.25
Range: 48.00
A boxplot generated in SPSS shows at a glance the distribution of results in the two age categories (Figure 1). The thick centre lines indicate the medians. The upper and lower bounds of the boxes indicate the inter-quartile range. The ‘whiskers’ show the top and bottom quarters of the data, approximately. The data point marked * indicates an outlier, i.e. an unusually high or low result. The width of the boxes has no significance.