The essence of describing datasets is to present a mass of data in a more understandable form. we may choose to describe the data by various graphical displays, which show the distribution of data among various intervals of the varying quantity. It is often necessary or desirable to consider the data in groups and determine the frequency for each group. In describing a dataset, we consider its centeral tendency, variability or spread of the data. In some situation, these give us a false picture of the data. We would talk about limitation of our ability to fully picture our data.
Central Tendencies
Mean
First of all, let us discuss about how to describe the cetral location our data. Various “averages” are used to indicate a central value of a set of data. Some of these are referred to as means. There are several kinds of means.
- Arithmetric mean: Of these “averages”, the most common and familiar is the arithmetic mean. This is simply the sum of all element in a population/sample divided by the size of the population/sample.
- Logaritm mean: Read more…
- Geometric mean: Read more…
- Harmonic mean: Read more…
The major undesired feature of the mean is it is massively affected by outliers. In finding the mean, all elements of the sample/population are taken into account. Such as Christiano Ronaldo and the Real Madrid or Micheal Jordan and his high school college team. These outliers change the outcome of a distribution due to their exceedingly high value or low value. The mean is ussually useful in population/sample that are symmetric and have no outliers.
Median
The median is another way of measuring the center location. It is simply finding the center value of the population sorted in order of increasing magnitude. One desirable property of the median over the mean is that it is not much affected by outliers. If the first numerical example in the previous paragraph is modified by replacing 40 by 140, the median is unchanged, whereas the arithmetic mean is changed appreciably. But along with this advantage goes the disadvantage that changing the size of any item without changing its position in the order of magnitude often has no effect on the median, so some information is lost.
If a distribution of items is very asymmetrical so that there are many more items larger than the arithmetic mean than smaller (or vice-versa), the median may be a more useful representative quantity than the arithmetic mean.
Mode
The mode is a method of finding the cetral location. If the frequency varies from one item to another, the mode is the value which appears most frequently. In the case of continuous variables the frequency depends upon how many digits are quoted, so the mode is more usefully considered as the midpoint of the class with the largest frequency.
Spread of the Data
This is a measure of how variable and spread the data is.
Sample range
One simple measure of variability is the sample range, the difference between the smallest item and the largest item in each sample.
Interquatile Range
The interquartile range is the difference between the upper quartile and the lower quartile. It is . It is used fairly frequently as a measure of variability, particularly in the Box Plot.
Mean Deviation
The mean deviation from the mean, defined as ∑(xi − µ )/N
, where i=1 mean,µ = ∑µi / N
, is useless because it is always zero when the data is symetrical.
Mean Absolute Deviation from the Mean
However, the mean absolute deviation from the mean defined as;
∑ |xi − µ|/N
where i=1 mean,µ = ∑µi / N
This is the msot frequently used in showing the variability of their data, although it is usually not the best choice. Its advantage is that it is simpler to calculate than the main alternative, the standard deviation and more accurate than the mean deviation.
Variance, v / Standard Deviation, σ
The variance is one of the most important descriptions of variability. It is defined as;
variance,v = ∑(xi − µ)^2 / N
v = σ^2
The standard deviation has the same units as the original data and is a representative of the deviations from the mean. Because of the squaring, it gives more weight to larger deviations than to smaller ones. Since the variance is the mean square of the deviations from the population mean, the standard deviation is the root-mean-square deviation from the population mean.
Root-mean-square quantities are also important in describing the alternating current of electricity. An analogy can be drawn between the standard deviation and the radius of gyration encountered in applied mechanics.
Bassel Correction
In some situation, when the sample mean is used as an estimate of the the total mean. There is a high chance of the sample mean having an error of correction. Basel Correction sets to resolve this problem. The estimate of variance obtained using the sample mean in place of the population can be made unbiased by multiplying by the factor ( N / N-1 )
. This is called Bessel’s correction. The estimate of σ is given the symbol s and is called the sample standard deviation. Sometimes this estimate will be high, sometimes it will be low, but in the long run it will show no bias if samples are taken randomly. The result of Bessel’s correction is that we have;
sample variance,v = ∑(xi − µ)^2 / (N-1)
v = s^2
Coefficient of Variation
A dimensionless quantity, the coefficient of variation is the ratio between the standard deviation and the mean for the same set of data, expressed as a percentage. This can be either (σ / µ)
or (s / x)
, whichever is appropriate, multiplied by 100%.
Limitations
Despite what we have discussed so far are able to describe a dataset in most cases. In ceratain situation they aren’t able to give us a full picture of our data. For instancs
Say given the table;
In such a dataset, using what have learnt it doesn’t give us a true description of our data. But it is simply y = |x|
.