The formula to compute the standard deviation from a sample of data looks complex – ugly even. But understanding the the notion of variation (e.g., standard deviation) is fundamental to statistical thinking.

At its core, the computation for sample standard deviation is very sensible (like so many other complex-appearing mathematical formulas, e.g., the distance formula). Essentially, the formula is computing a typical deviation – on “average” how far are the data from the mean. Thus the reason for computing the difference from each data point to the mean. Squaring and square rooting is important, as without it, the sum of all the deviations (which include positive and negative values) will always be zero – which would be a useless calculation. [Absolute values can be used instead of square – i.e., mean absolute deviation – though squaring is often preferred because, in contrast to absolute value, squaring is a smooth function with nice derivatives.]

Yet, if the goal is an “average” deviation, why would you divide by *n-1* (and not *n*)? When you compute the mean, or average, you sum all the values and divide by *n, *so why does the sample standard deviation divide by *n-1*? This is an important question. One which mathematics and statistics educators need to be ready to answer.

Some might say something to the effect of, well it has to do with the “degrees of freedom” – there are *n-1* degrees of freedom, so we divide by *n-1*. This is, to some degree, explanatory. But for the most part, this response feels like a smoke screen – it masks the explanation with sophisticated jargon, to cover up the fact that, well, it’s complicated. It’s a torero holding out a muleta as though the real deal were just behind it. Others might say that dividing by *n-1 *results in a slightly larger value than dividing by *n*, which provides a little bit of “wiggle room” for describing a typical deviation (e.g., data are unpredictable, so we need some buffer built into our statistics). Again, partly true. But then why not *n-2*? Might that be even better? Or why not *n-1.5* (which is, in fact, at times better)? The real truth about variance and standard deviation comes from a fundamental distinction between *descriptive* and *inferential* statistics.

*Descriptive statistics* attempt to describe a dataset. Mean, median, and mode can be regarded as a typical value – a measure of central tendency. In this sense, the notion of standard deviation acts as a description of how spread out (varied) the data are from the mean. It’s a measure of spread – and one in which the units are the same as the data. With data that have a normal distribution, “fatter” distributions have more variance and thus a larger standard deviation, and “skinnier” distributions have less variance and thus a smaller standard deviation. It is a measure of a *typical* deviation from the mean, and as such plays a descriptive role – describing a feature (spread) of the dataset’s distribution. But if this is the case, why don’t we divide by *n?* Well, in fact, we might – if the data comprised an entire population, and not just a sample.

*Inferential statistics* attempt to infer information from a smaller sample to an entire population. They are a “best guess”. And this, in fact, is the most common explanation for why the standard deviation formula divides by *n-1*. It has to do with the fact that this value has less to do with describing the current dataset and more to do with inferring something about the population’s dataset. If you really only want to describe the current dataset, then a true “average” deviation (dividing by *n*) would likely be better. However, dividing by *n-1* instead of *n* gives us very similar numbers (many times within hundredths of each other) – so much so that some argue for just dividing by *n* in most cases – and so the corrected sample standard deviation (dividing by *n-1*) also gives a near-descriptive statistic for a dataset. But in fact, its main purpose is inferential. Based on the arithmetic of expected values, the square of the standard deviation, *s*^2 (and not the square of the true “average” deviation), is an unbiased estimator for the population parameter, variance. (Proof Unbiased Estimator.) In general, standard deviation is referred to more often than variance – because it is simpler to grasp conceptually (i.e., same units as data) – but, in fact, it gets its calculation primarily from the fact that computing variance in this way (dividing by *n-1*) gives an unbiased estimate of the population’s variance. (The same is not quite true for standard deviation; the corrected sample standard deviation (dividing by *n-1*) gives a *better* estimate of the population parameter than the uncorrected sample standard deviation (dividing by *n*), though not quite completely unbiased.) Briefly, we note that there are times when it may be preferable to use other estimators; the maximum likelihood estimator for variance, for example, uses division by *n*, and has a lower mean squared error. The primary point, however, is that the statistics we use are often selected for their inferential ability to estimate, not just their descriptive power.

Sample standard deviation is prevalent in statistics for a variety of reasons. Firstly, it is rare that one ever has an entire population’s data. It is much more common to have a sample. And secondly, standard deviation is linked to one of the fundamental theorems in probability and statistics: the Central Limit Theorem (CLT). The CLT indicates that regardless of the underlying distribution of a population’s data, the distribution of the mean of *n*-sized (random) samples (with *n* sufficiently large, approximately greater than 30, and all independent and identically distributed random variables have the same mean, *mu*, and variance, *sigma^2*) will have an approximately normal distribution with parameters, N(*mu*, *sigma*^2/*n*). Given that the square of the sample standard deviation, *s*^2, is an unbiased estimator of *sigma*^2 (variance), and as a result of the CLT, the computation *s*/sqrt(*n*) is frequently used to provide confidence intervals for the true mean of a population. It is fairly incredible that from a single sample of, say, 100 people, we can provide with relatively high accuracy (most frequently, 95% is used) a range that contains the true population mean.

In closing, recognition that statistics are not always meant to be descriptive, instead, serving in a primarily inferential role is important. But this difference is not always that intuitive, for students or for teachers (e.g., Casey & Wasserman, 2015). This distinction must be clearer. Although the issues between standard deviation, variance, bias, and estimators are more nuanced, the broader idea that statistics are computed to have inferential meanings, and not just descriptive ones, is critical. And such key understandings must serve to guide our instruction. Otherwise we, as educators, risk providing students with smoke screens as a substitute for real reasoning and understanding.

Reference: Casey, S., & Wasserman, N. (2015). Teachers’ knowledge about informal line of best fit. *Statistics Education Research Journal, 14*(1), pp. 8-35.

For school mathematics teachers: Is this fact important to know? Is it important to be able to prove it? Does it come up in classrooms? Do students care? If it is important to know, why? What is important to know about it, if anything at all? As a professor and teacher educator interested in teachers’ knowledge – particularly in ways that more advanced mathematics becomes productive for teachers – I wrestle with these kinds of questions regularly. The proof of this fact, 0.9999… = 1, can come in a variety of forms, but it draws on the notion of infinity and limits, etc. It can be proved through arguments from analysis about convergent series (sum on k of an infinite geometric series, (9/10)^k), algebraic techniques (e.g., let x=0.9999…, then 10x=9.9999…, etc.), or by computational arguments (e.g., 1/3=0.3333…, multiply both sides by 3…). But does this constitute important knowledge for teachers? And if so, why?

Simon’s (2006) notion of a Key Developmental Understanding (KDU) has become one way that I have begun to consider such questions. Simon describes a KDU as “a change in a students’ ability to think about and/or perceive particular mathematical relationships” (p. 362). In this case, the “students” being referred to are teachers. Simon emphasizes that KDUs are not *missing* pieces of information, but rather key understandings that foster one’s ability to think about and perceive mathematical ideas and relationships. They represent ontological shifts and transformations in teachers’ available assimilatory structures – that mathematical ideas have been re-understood, re-organized, re-structured, etc. For my own interests specifically, in what ways can mathematics that is not in a local neighborhood of the content a teacher teaches influence their understanding of and/or perceptions about the content they teach?

So what key understanding is gained from knowing that 0.9999…=1? To me, one of the understandings that is important is about the structure of the real numbers. One of the reasons that students (and teachers) like decimals – as opposed to fractions – is that it always produces the same decimal expansion. In other words, a conceptually difficult issue with fractions is that 1/4=2/8=3/12=… (an infinite number of equivalent representations) gets resolved with decimals. Type 1/4, 2/8, 3/12, etc. on the calculator and they all produce 0.25. So decimals seem easier or more consistent in some ways. But does the issue really get resolved? In fact, using decimal expansions, which necessitate infinite decimal expansions (e.g., 1/3=0.3333…) comes with its own set of conceptually difficult issues. Namely, that if we agree to an infinite decimal expansion, we have to deal with infinity. Which means we have to grapple with the odd conclusion that 0.9999… is, in fact, equal to 1. And so, in reality, decimals have equivalence classes in the same way that fractions do. And not just the seemingly trivial ones, such as 0.25 = 0.2500; but, additionally, that 0.25 must also equal 0.249999… . Indeed, any terminating decimal will have such an equivalence class; repeating decimals or irrational numbers will not. For me, this is perhaps one of the fundamental understandings that comes from knowing 0.9999…=1: that the set of real numbers has a structure that contains similarities to the set of fractions, and differences. Decimal representations, like fractions, have equivalence classes, albeit different in nature than those for the set of fractions. And this is something to be grappled with. Although decimals have some advantages in terms of understanding the relative size of fractions, they, too, come with conclusions that we must accept if we are to truly understand them.

Reference: Simon, M.A. (2006). Key developmental understandings in mathematics: A direction for investigating and establishing learning goals. Mathematical Thinking and Learning, 8(4), pp. 359-371.

**“What you do to the top, you do to the bottom.”** This mantra is often repeated to help students generate equivalent fractions (something they notoriously have difficulty with.) Unfortunately, the adage, while true for multiplication, does not work with addition – which can be a source of confusion.

**“Just add a zero when you multiply by 10.”** Indeed, multiplying by ten in our base ten system is frequently easy than multiplying by other numbers. However, while it is often true, the result of multiplying by 10 is not always simply adding a 0 to the end of a number – 3.2 x 10, for example, is 32.

**“Multiplying makes bigger.”** In elementary mathematics, with natural numbers, this is often the case (not with 0 or 1, though). However, this notion may make students’ future work with multiplication of fractions more difficult, since this idea does not necessarily hold. (Example from McCrory, et al. (2012).)

**“You can’t subtract a larger number from a smaller one.” **

**“Anything to the zero power is 1.”** When students first have to expand their understanding of exponents to broader number sets, in particular to those that doesn’t make intuitive sense as “repeated multiplication”, there are many ways teachers try to help students learn these ideas. Why we define 5^0 as 1 takes some genuine work. And while most numbers to the zero power are one, both 0^0 and ∞^0 are indeterminate, and have some real implications in terms of developing calculus ideas from limits.

**“Perimeter is just the sum of all the sides.”** This idea works really well with polygons. However, for circles it makes no sense. Perimeter is the distance around a two-dimensional object, which may or may not be composed of straight sides. In fact, this may be part of the difficulty transitioning students to understanding how we calculate the circumference of a circle – the relationship is a multiplicative one (a comparison), not an additive one, where one can find the total from summing smaller lengths.

**“There are half as many even numbers as whole numbers.” **While it is true that half of the whole numbers are even and half are odd numbers, comparing the relative size of infinite sets is less obvious. In fact, based on bijective mappings, what we find is the the set of even numbers has the same cardinality as the set of whole numbers. In fact, it has the same cardinality as the set of integers, and even the set of rational numbers.

**“If it fails the vertical line test, its not a function.”** This is certainly true with graphs of functions on a Cartesian coordinate system. However, move to polar coordinates and functions start having very interesting shapes (lemniscates, cardiods, limacons, rose curves), very few of which pass the ‘vertical line test’. Functions have to do with every input having a unique output – on a Cartesian coordinate system, your inputs are points on a number line (not, say, angles), which makes the vertical test useful.

These are just some examples – probably only the tip of the iceberg. The larger point is that part of the mathematical work of teaching revolves around knowing the ways that the ideas we talk about as teachers get complicated in further developments, so that we provide proper attention to the details of how we describe and conceptualize them for students, making the overarching idea explicit rather than over-reliance on mnemonic devices. If you have other examples to share, post a comment!

Reference: McCrory, Floden, Ferrini-Mundy, Reckase, & Senk (2012). Knowledge of algebra for teaching: A framework of knowledge and practices. Journal for Research in Mathematics Education, 43(5), pp. 584-615.

]]>