The formula to compute the standard deviation from a sample of data looks complex – ugly even. But understanding the the notion of variation (e.g., standard deviation) is fundamental to statistical thinking.

At its core, the computation for sample standard deviation is very sensible (like so many other complex-appearing mathematical formulas, e.g., the distance formula). Essentially, the formula is computing a typical deviation – on “average” how far are the data from the mean. Thus the reason for computing the difference from each data point to the mean. Squaring and square rooting is important, as without it, the sum of all the deviations (which include positive and negative values) will always be zero – which would be a useless calculation. [Absolute values can be used instead of square – i.e., mean absolute deviation – though squaring is often preferred because, in contrast to absolute value, squaring is a smooth function with nice derivatives.]

Yet, if the goal is an “average” deviation, why would you divide by *n-1* (and not *n*)? When you compute the mean, or average, you sum all the values and divide by *n, *so why does the sample standard deviation divide by *n-1*? This is an important question. One which mathematics and statistics educators need to be ready to answer.

Some might say something to the effect of, well it has to do with the “degrees of freedom” – there are *n-1* degrees of freedom, so we divide by *n-1*. This is, to some degree, explanatory. But for the most part, this response feels like a smoke screen – it masks the explanation with sophisticated jargon, to cover up the fact that, well, it’s complicated. It’s a torero holding out a muleta as though the real deal were just behind it. Others might say that dividing by *n-1 *results in a slightly larger value than dividing by *n*, which provides a little bit of “wiggle room” for describing a typical deviation (e.g., data are unpredictable, so we need some buffer built into our statistics). Again, partly true. But then why not *n-2*? Might that be even better? Or why not *n-1.5* (which is, in fact, at times better)? The real truth about variance and standard deviation comes from a fundamental distinction between *descriptive* and *inferential* statistics.

*Descriptive statistics* attempt to describe a dataset. Mean, median, and mode can be regarded as a typical value – a measure of central tendency. In this sense, the notion of standard deviation acts as a description of how spread out (varied) the data are from the mean. It’s a measure of spread – and one in which the units are the same as the data. With data that have a normal distribution, “fatter” distributions have more variance and thus a larger standard deviation, and “skinnier” distributions have less variance and thus a smaller standard deviation. It is a measure of a *typical* deviation from the mean, and as such plays a descriptive role – describing a feature (spread) of the dataset’s distribution. But if this is the case, why don’t we divide by *n?* Well, in fact, we might – if the data comprised an entire population, and not just a sample.

*Inferential statistics* attempt to infer information from a smaller sample to an entire population. They are a “best guess”. And this, in fact, is the most common explanation for why the standard deviation formula divides by *n-1*. It has to do with the fact that this value has less to do with describing the current dataset and more to do with inferring something about the population’s dataset. If you really only want to describe the current dataset, then a true “average” deviation (dividing by *n*) would likely be better. However, dividing by *n-1* instead of *n* gives us very similar numbers (many times within hundredths of each other) – so much so that some argue for just dividing by *n* in most cases – and so the corrected sample standard deviation (dividing by *n-1*) also gives a near-descriptive statistic for a dataset. But in fact, its main purpose is inferential. Based on the arithmetic of expected values, the square of the standard deviation, *s*^2 (and not the square of the true “average” deviation), is an unbiased estimator for the population parameter, variance. (Proof Unbiased Estimator.) In general, standard deviation is referred to more often than variance – because it is simpler to grasp conceptually (i.e., same units as data) – but, in fact, it gets its calculation primarily from the fact that computing variance in this way (dividing by *n-1*) gives an unbiased estimate of the population’s variance. (The same is not quite true for standard deviation; the corrected sample standard deviation (dividing by *n-1*) gives a *better* estimate of the population parameter than the uncorrected sample standard deviation (dividing by *n*), though not quite completely unbiased.) Briefly, we note that there are times when it may be preferable to use other estimators; the maximum likelihood estimator for variance, for example, uses division by *n*, and has a lower mean squared error. The primary point, however, is that the statistics we use are often selected for their inferential ability to estimate, not just their descriptive power.

Sample standard deviation is prevalent in statistics for a variety of reasons. Firstly, it is rare that one ever has an entire population’s data. It is much more common to have a sample. And secondly, standard deviation is linked to one of the fundamental theorems in probability and statistics: the Central Limit Theorem (CLT). The CLT indicates that regardless of the underlying distribution of a population’s data, the distribution of the mean of *n*-sized (random) samples (with *n* sufficiently large, approximately greater than 30, and all independent and identically distributed random variables have the same mean, *mu*, and variance, *sigma^2*) will have an approximately normal distribution with parameters, N(*mu*, *sigma*^2/*n*). Given that the square of the sample standard deviation, *s*^2, is an unbiased estimator of *sigma*^2 (variance), and as a result of the CLT, the computation *s*/sqrt(*n*) is frequently used to provide confidence intervals for the true mean of a population. It is fairly incredible that from a single sample of, say, 100 people, we can provide with relatively high accuracy (most frequently, 95% is used) a range that contains the true population mean.

In closing, recognition that statistics are not always meant to be descriptive, instead, serving in a primarily inferential role is important. But this difference is not always that intuitive, for students or for teachers (e.g., Casey & Wasserman, 2015). This distinction must be clearer. Although the issues between standard deviation, variance, bias, and estimators are more nuanced, the broader idea that statistics are computed to have inferential meanings, and not just descriptive ones, is critical. And such key understandings must serve to guide our instruction. Otherwise we, as educators, risk providing students with smoke screens as a substitute for real reasoning and understanding.

Reference: Casey, S., & Wasserman, N. (2015). Teachers’ knowledge about informal line of best fit. *Statistics Education Research Journal, 14*(1), pp. 8-35.

So let’s take a look at another problem. (Note: this problem was inspired by conversations with a colleague, Bill Zahner.) Let’s assume that we are going to construct an isosceles triangle. The base length is 10 *in*. The lengths of the other two sides will be a randomly chosen real number between 5 and 10 *in* (note, the side length being greater than 5 *in* guarantees forming a triangle, and anything under 10 *in* makes it non-equilateral). What is the probability that the resultant triangle is acute? obtuse? right? The figure below provides some insight – where along the perpendicular bisector of the isosceles triangle determines the classification of the triangle.

So, using geometric probability, we could determine the likelihood of forming an obtuse triangle by computing the ratio of lengths: namely, the length up to the semicircle, which is 5 *in*, divided by the entire length, which is 5√3. So P(obtuse)=5/5√3 ≈ 0.577. Similarly, P(acute)=(5√3 – 5)/5√3 ≈ 0.423, and P(right)=0. Dealing with the probability of forming a right triangle being zero, despite in fact being possible, is difficult enough; but dealing with the fact that long-term simulations of the problem put the probability of being obtuse as less than the probability of being acute, indicating that these probabilities are, in fact, incorrect, takes additional insight into and understanding about the underlying assumptions of geometric probabilities.

Using geometric probability *assumes* that every point in the space is equally likely – in other words, that the distribution of outcomes is uniform. In this case, the “space” is the line segment (along the perpendicular bisector) for where the third vertex of the triangle could be. Although it is possible for the third vertex to fall anywhere along this line, how these points fall on that line is, in fact, not uniformly distributed. The video below models the situation in motion, indicating that the third vertex being near the base is much less likely than other places – in other words, the possible vertex points are not uniformly distributed. Using geometric measurements to determine the probability is inappropriate.

http://vimeo.com/105967203 (password: mathematicalmusings)

So, how do you determine the real probability? There are a few ways, one of which uses the random (uniform) distribution of lengths of the sides being between 5 and 10 *in*; however, another way is to determine the probability density function for the height (*x*) of the third vertex – which we know is not uniform – but that is: *x*/(5√(*x*^2+25)) for 0<*x*<5√3. The image below shows the plot of the actual density function (also comparing it to a uniform distribution) – the unlikelihood of the third vertex being near the base (x≈0) is evident. Determining probabilities from a density function amounts to computing the area under the curve, i.e., integrating. As it turns out, the actual probabilities are close to swapped: P(obtuse) ≈ 0.414 and P(acute) ≈ 0.586. (P(right) is still zero.)

So what do we make of this? One thing to understand is that using geometric measurements to determine probabilities has one *major* assumption: that points are uniformly distributed in the space. When this assumption is not met, using a simple ratio of geometric measurements to determine probability is inappropriate.

Looking back at our original two examples, one begins to wonder whether the probabilities we computed would, in fact, be justified…

]]>