The formula to compute the standard deviation from a sample of data looks complex – ugly even. But understanding the the notion of variation (e.g., standard deviation) is fundamental to statistical thinking.

At its core, the computation for sample standard deviation is very sensible (like so many other complex-appearing mathematical formulas, e.g., the distance formula). Essentially, the formula is computing a typical deviation – on “average” how far are the data from the mean. Thus the reason for computing the difference from each data point to the mean. Squaring and square rooting is important, as without it, the sum of all the deviations (which include positive and negative values) will always be zero – which would be a useless calculation. [Absolute values can be used instead of square – i.e., mean absolute deviation – though squaring is often preferred because, in contrast to absolute value, squaring is a smooth function with nice derivatives.]

Yet, if the goal is an “average” deviation, why would you divide by *n-1* (and not *n*)? When you compute the mean, or average, you sum all the values and divide by *n, *so why does the sample standard deviation divide by *n-1*? This is an important question. One which mathematics and statistics educators need to be ready to answer.

Some might say something to the effect of, well it has to do with the “degrees of freedom” – there are *n-1* degrees of freedom, so we divide by *n-1*. This is, to some degree, explanatory. But for the most part, this response feels like a smoke screen – it masks the explanation with sophisticated jargon, to cover up the fact that, well, it’s complicated. It’s a torero holding out a muleta as though the real deal were just behind it. Others might say that dividing by *n-1 *results in a slightly larger value than dividing by *n*, which provides a little bit of “wiggle room” for describing a typical deviation (e.g., data are unpredictable, so we need some buffer built into our statistics). Again, partly true. But then why not *n-2*? Might that be even better? Or why not *n-1.5* (which is, in fact, at times better)? The real truth about variance and standard deviation comes from a fundamental distinction between *descriptive* and *inferential* statistics.

*Descriptive statistics* attempt to describe a dataset. Mean, median, and mode can be regarded as a typical value – a measure of central tendency. In this sense, the notion of standard deviation acts as a description of how spread out (varied) the data are from the mean. It’s a measure of spread – and one in which the units are the same as the data. With data that have a normal distribution, “fatter” distributions have more variance and thus a larger standard deviation, and “skinnier” distributions have less variance and thus a smaller standard deviation. It is a measure of a *typical* deviation from the mean, and as such plays a descriptive role – describing a feature (spread) of the dataset’s distribution. But if this is the case, why don’t we divide by *n?* Well, in fact, we might – if the data comprised an entire population, and not just a sample.

*Inferential statistics* attempt to infer information from a smaller sample to an entire population. They are a “best guess”. And this, in fact, is the most common explanation for why the standard deviation formula divides by *n-1*. It has to do with the fact that this value has less to do with describing the current dataset and more to do with inferring something about the population’s dataset. If you really only want to describe the current dataset, then a true “average” deviation (dividing by *n*) would likely be better. However, dividing by *n-1* instead of *n* gives us very similar numbers (many times within hundredths of each other) – so much so that some argue for just dividing by *n* in most cases – and so the corrected sample standard deviation (dividing by *n-1*) also gives a near-descriptive statistic for a dataset. But in fact, its main purpose is inferential. Based on the arithmetic of expected values, the square of the standard deviation, *s*^2 (and not the square of the true “average” deviation), is an unbiased estimator for the population parameter, variance. (Proof Unbiased Estimator.) In general, standard deviation is referred to more often than variance – because it is simpler to grasp conceptually (i.e., same units as data) – but, in fact, it gets its calculation primarily from the fact that computing variance in this way (dividing by *n-1*) gives an unbiased estimate of the population’s variance. (The same is not quite true for standard deviation; the corrected sample standard deviation (dividing by *n-1*) gives a *better* estimate of the population parameter than the uncorrected sample standard deviation (dividing by *n*), though not quite completely unbiased.) Briefly, we note that there are times when it may be preferable to use other estimators; the maximum likelihood estimator for variance, for example, uses division by *n*, and has a lower mean squared error. The primary point, however, is that the statistics we use are often selected for their inferential ability to estimate, not just their descriptive power.

Sample standard deviation is prevalent in statistics for a variety of reasons. Firstly, it is rare that one ever has an entire population’s data. It is much more common to have a sample. And secondly, standard deviation is linked to one of the fundamental theorems in probability and statistics: the Central Limit Theorem (CLT). The CLT indicates that regardless of the underlying distribution of a population’s data, the distribution of the mean of *n*-sized (random) samples (with *n* sufficiently large, approximately greater than 30, and all independent and identically distributed random variables have the same mean, *mu*, and variance, *sigma^2*) will have an approximately normal distribution with parameters, N(*mu*, *sigma*^2/*n*). Given that the square of the sample standard deviation, *s*^2, is an unbiased estimator of *sigma*^2 (variance), and as a result of the CLT, the computation *s*/sqrt(*n*) is frequently used to provide confidence intervals for the true mean of a population. It is fairly incredible that from a single sample of, say, 100 people, we can provide with relatively high accuracy (most frequently, 95% is used) a range that contains the true population mean.

In closing, recognition that statistics are not always meant to be descriptive, instead, serving in a primarily inferential role is important. But this difference is not always that intuitive, for students or for teachers (e.g., Casey & Wasserman, 2015). This distinction must be clearer. Although the issues between standard deviation, variance, bias, and estimators are more nuanced, the broader idea that statistics are computed to have inferential meanings, and not just descriptive ones, is critical. And such key understandings must serve to guide our instruction. Otherwise we, as educators, risk providing students with smoke screens as a substitute for real reasoning and understanding.

Reference: Casey, S., & Wasserman, N. (2015). Teachers’ knowledge about informal line of best fit. *Statistics Education Research Journal, 14*(1), pp. 8-35.

So let’s take a look at another problem. (Note: this problem was inspired by conversations with a colleague, Bill Zahner.) Let’s assume that we are going to construct an isosceles triangle. The base length is 10 *in*. The lengths of the other two sides will be a randomly chosen real number between 5 and 10 *in* (note, the side length being greater than 5 *in* guarantees forming a triangle, and anything under 10 *in* makes it non-equilateral). What is the probability that the resultant triangle is acute? obtuse? right? The figure below provides some insight – where along the perpendicular bisector of the isosceles triangle determines the classification of the triangle.

So, using geometric probability, we could determine the likelihood of forming an obtuse triangle by computing the ratio of lengths: namely, the length up to the semicircle, which is 5 *in*, divided by the entire length, which is 5√3. So P(obtuse)=5/5√3 ≈ 0.577. Similarly, P(acute)=(5√3 – 5)/5√3 ≈ 0.423, and P(right)=0. Dealing with the probability of forming a right triangle being zero, despite in fact being possible, is difficult enough; but dealing with the fact that long-term simulations of the problem put the probability of being obtuse as less than the probability of being acute, indicating that these probabilities are, in fact, incorrect, takes additional insight into and understanding about the underlying assumptions of geometric probabilities.

Using geometric probability *assumes* that every point in the space is equally likely – in other words, that the distribution of outcomes is uniform. In this case, the “space” is the line segment (along the perpendicular bisector) for where the third vertex of the triangle could be. Although it is possible for the third vertex to fall anywhere along this line, how these points fall on that line is, in fact, not uniformly distributed. The video below models the situation in motion, indicating that the third vertex being near the base is much less likely than other places – in other words, the possible vertex points are not uniformly distributed. Using geometric measurements to determine the probability is inappropriate.

http://vimeo.com/105967203 (password: mathematicalmusings)

So, how do you determine the real probability? There are a few ways, one of which uses the random (uniform) distribution of lengths of the sides being between 5 and 10 *in*; however, another way is to determine the probability density function for the height (*x*) of the third vertex – which we know is not uniform – but that is: *x*/(5√(*x*^2+25)) for 0<*x*<5√3. The image below shows the plot of the actual density function (also comparing it to a uniform distribution) – the unlikelihood of the third vertex being near the base (x≈0) is evident. Determining probabilities from a density function amounts to computing the area under the curve, i.e., integrating. As it turns out, the actual probabilities are close to swapped: P(obtuse) ≈ 0.414 and P(acute) ≈ 0.586. (P(right) is still zero.)

So what do we make of this? One thing to understand is that using geometric measurements to determine probabilities has one *major* assumption: that points are uniformly distributed in the space. When this assumption is not met, using a simple ratio of geometric measurements to determine probability is inappropriate.

Looking back at our original two examples, one begins to wonder whether the probabilities we computed would, in fact, be justified…

]]>

Counting problems, while often simple to state, can span the spectrum of difficulty – from ridiculously easy to insanely complex. Often, however, one of the tools of the trade in counting problems is counting an “analogous” problem. This process, however, requires functions, and in particular, bijections: if there is a bijective function between two sets, then the two sets have the same cardinality (or size). A few thoughts about how understanding more abstract sets and classifications for functions can be important are discussed below.

1.The handshake problem. In a group of 10 people, if everyone shakes hands with everyone else, how many total handshakes are given?

This is a familiar problem. Having done this with high school students, one of the common ways of approaching this problem is to physically model the handshakes. Done in a systematic way, this often results in the series, 9+8+7+…+3+2+1 = 45. This is a nice way to solve the problem, and which can be a nice way of introducing arithmetic series. However, in the context of combinatorics, one very powerful way is to model this problem differently. In particular, it is to create a bijective function, mapping every “physical” handshake to an ordered pair. If we assume the first 10 letters of the alphabet represent the 10 people in the room, then every handshake involves two people, and we can list the handshake between A and F as (A,F). (In particular, since we only want “one” handshake, we only want, say, ordered pairs in alphabetical order, so that (A,F) is counted, but not (F,A).) We can verify that this is a bijective function by clarifying that “every handshake” will map to an ordered alphabetical pair, and every ordered alphabetical pair will be mapped to from some handshake. Indeed, this is the case. The power in doing so is that we no longer have to think about the physical activity of shaking hands. By counting a different problem, which began by mapping the objects we wanted to count to a set of more countable objects – how many ordered alphabetical pairs are there with the first ten letters of the alphabet – we can make a conclusion about the total number of handshakes. Since there are 10C2 such ordered pairs, then there must be 10C2 (or 10*9/2! = 45) handshakes.

2. How do we know that 7C2 is the same as 7C5 (without relying on the formulas)? Or, more generally, why is nCk = nC(n-k)?

Based on the formula for a combination, it is easy to conclude that these two quantities are the same size. However, it may be less obvious why there are the same number of ways to form a “pair” as a “group of 5” from a room with seven people. There are analogies that can help explain, but the analogies often make use of bijective functions. The set of “pairs” of people look like: {(A,B), (A,C), (A,D), …(F,G)}. The set of “group of 5” look like: {(A,B,C,D,E), (A,B,C,D,F), … (C,D,E,F,G)}. How do we know there are the same number of elements in each set? The easiest way is to find a bijective function between these two “sets” of objects. Indeed, while there are many, perhaps the easiest to justify is to map (A,B) -> (C,D,E,F,G), (A,C) -> (B,D,E,F,G), etc., effectively mapping the “pair” to the “remaining people left”. Indeed, every pair formed has exactly 5 people not chosen, which means every “pair” will map to some element in the “group of 5 set”; similarly, every “group of 5” formed will be mapped from, because every “group of 5” has exactly 2 people not chosen, which will be the “pair” that maps to it. Thus, this mapping between these sets of objects can be used to verify that sets of 7C2 and 7C5 have the same cardinality. (The analogy being that for every group of 2 selected, there are 5 not selected – thus they must be the same size sets.) The more general argument concluding nCk=nC(n-k) follows naturally.

3. How many different subsets can be formed from a set of n elements?

One way of answering this problem is to say that each element can either be “in” or “out” of the subset, thereby creating 2^n subsets. Another argument, an inductive one, is to show that if there are 2^n subsets with n elements, there will be 2^n+1 subsets with n+1 elements. Again, this involves a bijective mapping. Say we have all the subsets with “n” elements listed – what does adding the element “n+1” do? Well, all of the subsets with “n” elements are also subsets of this new set. Additionally, there are subsets with the element “n+1” that we need to count. Considering this set, there is a bijective function between the subsets with the element “n+1” and the (2^n) subsets with “n” elements. Each subset with “n+1”, for example: (1, 3, 5, 7, n+1) can be mapped to the subset without the “n+1” element, in this case, to (1, 3, 5, 7). In this way, (n+1) gets mapped to the empty set (from the subsets with “n” elements), and, indeed, since every other subset with “n+1” has some elements from the “n” elements, there is a bijection. The conclusion, of course, is that the cardinalities of these two sets (subsets with “n” elements, and subsets including the element “n+1”) are the same: namely, they are both 2^n. Thus, there are 2^n+2^n = 2^(n+1) subsets with “n+1” elements, which completes the inductive step of the proof.

4. How many numbers between 0 and 10,000 have digits that sum to 9?

This is a relatively difficult problem, and the easiest way to solve it is approaching it as a multi-choose (e.g., stars and bars) problem. But even this can be hard to follow and conceptualize. In fact, what is happening in this process is a bijection. The solution actually consists of the numbers: {0009, 0090, …, 3105, …}. But how many are there? The stars and bars method in this case is actually “mapping” each of these numbers to a 9-letter object. In particular, if we let Th=Thousands, H=hundreds, T=tens, and O=ones, then each of the solution elements can be created by a 9-string using these four letters (again, we will consider these written as “alphabetical” based on digits): ThThHHTTTOO maps to 2,232. And because each of these strings has 9 letters, we can verify that the sum of the digits will be equal to 9 in every case; and since every number between 0 and 10,000 will have a certain number (between 0 and 9) of Ths, Hs, Ts, and Os, then there is a bijection between these two sets. Therefore, we can instead count the number of 9-strings, from four “letters” (where letters can repeat and order does not matter – aka, we only consider “one” of these strings – namely, the “alphabetical” based on digits one), which amounts to 12C3 ways. But the key to answering the original question lies in transforming it, modeling it, mapping it, to a collection of objects that is easier to count.

5. Random variables in statistics

Lastly, another common application of functions comes from statistics. In fact, random variables *are* functions. In particular, they are functions that map the outcome set to the set of real numbers. For example, the outcome set for flipping a coin three times consists of the following elements – effectively the “data” from the experiement: {HHH, HHT, HTH, HTT, THH, THT, TTH ,TTT}. Yet often what is of interest is a random variable, something like, say, X=the number of heads. Accordingly, this function, X, then maps each of the elements in the set to a number: 0, 1, 2, or 3 (i.e., HHH -> 3, whereas HTT -> 1). Indeed, understanding this relationship is critical to properly conceptualizing aspects of probability. In the original outcome set each of the eight outcomes are equally-likely, whereas because of the functional mapping process the random variable outcomes (0, 1, 2, 3) are not equally-likely. Notably, this mapping is not a bijection, since 1, for example, is mapped to from three different elements (HTT, THT, TTH) – which is in fact the reason that getting 1 head is three times as likely as getting 0 heads.

These examples are by no means the only ones. However, they do perhaps provide some real applications and uses for understanding sets and functions more generally (particularly bijective functions), and they can provide a relatively natural contexts – counting problems – for becoming increasingly familiar with these concepts. Much of the time when we “solve a related” counting problem, we end up applying a bijective function – whether we are aware of it or not – as the means to help verify that we have counted correctly.

]]>In order to communicate why these individual properties are important collectively, one possible activity that I have used is to elaborate on solving simple, single-step, equations. In class, it is common to use a “single step” to solve a simple equation, for example, x+5=12; however, there are actually four assumptions being made about how the operation addition works on the set of real numbers. These assumptions, collectively, are important for algebraic reasoning.

*x *+ 5 = 12

(*x *+ 5) + -5 = 12 + -5

*x *+(5 + -5) = 12 + -5* (Associativity (of addition on *R*))*

*x *+ 0 = 12 + -5* (Inverse elements (of addition on *R*))*

*x * = 12 + -5* (Identity element (of addition on *R*))*

*x *= 7* (Closure (of addition on *R*))*

Without these assumptions, of associativity, inverse elements, identity element, and closure, the algebraic solving process may not generalize. While in this case we make use of -5, in the more general case, it would have to be true that every element had an inverse element. Similarly, while it is 0 under addition, without an identity element the solving process would loop infinitely: the identity element is the key to transforming an unknown sum (x+5) into a known sum (x+0). Also, while in this case the sum of 12 and -5 is a number, in the more general case, it would have to be true that the sum of any two elements produced another number. (We note that commutativity is not required to be a group, but is required for other important algebraic structures, such as a field.)

Indeed, perhaps the more powerful illustration of these four properties working collectively as a foundation for algebra is to solve an equation on an unfamiliar set and operation. For example, what is the solution to the equation *X ° RX *=* R*2? (Based on the operation table below, e.g., *R1**° RY *=* RZ*.)

While searching the table for the solution is a valid approach to find the value for *X*, this more abstract case can be used to present the collective impact of these four axioms of a group. First, I note that this table is analogous to the composition of the symmetries of a triangle, which indeed is a group. (Another option is to verify each of these four properties based only on the table above.) The operation is closed on this set (as the composition of any two elements forms an element in the set); the identity element is *R0*; each element has a unique inverse element that produces *R0*; and composition is associative. Therefore, to find *X* algebraically, it is necessary only to apply each of these properties in turn, and compute the result of one composition (*R*2* ° RX*), to solve for *X*.

*X ° RX *=* R*2

*(X ° RX) ° RX *=* R*2* ° RX
*

*X ° (RX ° RX) *=* R*2* ° RX (Associativity (of composition on the set of triangle symmetries))*

*X ° R*0 = *R*2* ° RX (Inverse Elements (of composition** **on** the set **of triangle symmetries**))*

*X *=* R*2* ° RX ** ** ** ** * *(Identity Element (of composition on the set **of triangle symmetries**))*

*X *=* RZ (Closure (of composition on the set **of triangle symmetries**))*

Indeed, as evident from the example above of solving a simple equation, it is not just the individual arithmetic properties – such as associativity, identity element, inverse elements, closure – that are meaningful in mathematics, but rather their collective importance as they become a necessary structure for algebra and algebraic reasoning. It is this basic structure of a group that turns the process of “guess and check” (which was the natural inclination for searching the table in the abstract example, as well as students’ tendency when first introduce to solving equations) into systematic reasoning, based on the collection of arithmetic properties.

]]>