There are four types of data, or levels of measurement, that show up in databases. Everything that goes into the database or shows up on a survey corresponds to one of these levels, from address fields to income. It is important to recognize which is which, because not all measures can be used on each of the types, and it is easy to make the wrong conclusions about the data by using the wrong measures.
The four types of data are:
- Nominal data is where the numerical value is irrelevant to the label. It cannot be meaningfully ranked, and no meaningful math can be performed on it. An example might be a survey that asked what a user’s political party is (“1) Democrat, 2) Republican…”) and “What kind of computer do you own?” (“1) Apple, 2) Dell, 3) HP…”).
Data that is nominal cannot be directly compared. You can’t say that Apple is greater than or less than Dell, or that Red is greater than or less than Blue. Addition is also meaningless: what is Laptop + Desktop equal to?
The most common statistics for nominal data relate to frequency. We can compare the frequency that each of the items shows up, find the mode, and perform unions and intersections of sets of data.
- Ordinal data can be ranked (1st, 2nd, 3rd), but cannot be added or subtracted meaningfully. Examples include questions such as “On a scale of 1-5, with 1 being ‘least satisfied’ and 5 being ‘most satisfied,’ how satisfied are you with your work station” and “How much do you make each year? 1) < $30k 2) $30k—$50k, etc.”
In these cases we can make direct comparisons, but we still cannot add, subtract, multiply or divide meaningfully.
One important characteristic of Ordinal data is that it cannot be averaged since the 5th number’s distance from the first number is entirely arbitrary. What we can do, to determine central tendency, is take the median. As an example of how this might work, consider a questionnaire where people have been asked to rank their job satisfaction on a scale of 1 to 5. 10 people report 1, 5 people report 2, 10 report 3, 20 report 4, and 10 report 5.
The mean of this sample is 3.3 and doesn’t mean much, but the median indicates that at least 50% of the people are at 4 or above and is a better measure of central tendency. It is very common to see companies take the arithmetic mean of ordinal data as part of their standard reports, which can lead to incorrect conclusions about the data.
- Interval data consists of numbers on a scale. With interval data not only can two values be compared, but the difference between those values is meaningful. Addition and multiplication still are not meaningful, but the arithmetic mean is.
An example here might be an individual’s year of birth. If we have three people who were born in 1980, 1978, and 1982 we know that there is a four year difference between them and that the average birth year is 1980, but 1980+1978=3958 isn’t particularly meaningful.
We can perform most basic analysis on interval data. We can take the mean and the median, for example, and can also perform weighted averages and calculate standard deviations. However some of the more advanced options, most of which aren’t used in businesses, are unavailable.
- With ratio data the ratio between two numbers is meaningful. One can meaningfully say that “a is twice b.” In order to achieve this ratio measures have a non-arbitrary zero value.
An example of a ratio measurement would be income: 0 is the breaking point between owing money and having money, and we can meaningfully say that \$200k is twice \$100k.
All statistical measures can be used here, including the geometric mean and the arithmetic mean. A word of caution though: just because they can be used does not mean that they should be used. Arithmetic means are sensitive to outliers, and certain tests depend on the data being normally distributed. More on this at a later date.