Overview

Statistics in General

The origin of mathematics, and statistics, is measuring and counting things. A further development - some time after the origins - is the exploration of the relationships between measurements (shoe size and age are obviously related to one another in some way; or the dose of a medication and the healing effect).

But measurements fluctuate. Suppose you want to measure how tall you are. You back up to the wall and try to wedge the bottom of the measuring tape against the wall with your heal. Then you put your hand on your head and try to mark a spot on the wall and measure from the floor to the spot. Obviously, there are other ways, but whatever way you choose, if you do it twice or three times, you are likely to get two or three different values. They will be close to one another, but differnt. So, what is the "real" value? How do you decide? Once you've settled on one measurement, how good is it? How much does it fluctuate?

Suppose I invented some method of guessing your age. If I told you I could estimate your age to within 50 years, you would not be impressed. My method would not be much better than a random guess. If I said within 5 days, you would be. My method would clearly be better than just guessing. It would be more credible.

Statistics tries to separate the "true" value from the random fluctuations as well as evaluating the "goodness", or accuracy, of the measurement. If I am using statistical testing (much more about this on other pages in this website), and the test fails, the conclusion is that my observations are not distinguishable from random fluctuations (to within a reasonable degree of credibility).

There you have it in a nutshell. It is not more sophisticated than that, although the formulas become impressive. The best way to proceed is to re-read this section and meditate on it. It will pay off.

Variables and Methods

We classify all the measurements into two types: "continuous" and "discrete". There are, perhaps, more theoretical/philosophical questions about whether there are other types, but you can live a long and happy life without participating in the discussion.

Continuous Variables

Continuous variables are those like height, temperature, speed, etc. Those numbers that can be expressed as decimals (even infinitely long ones like pi, one-third, and such). Further categories of continuous variables are possible, like ones that are only positive, or ones that only lie in a particular region. Like blood pressure, say the "systolic" one (the first - the "120", when you give yours as "120 over 80"). When discussing live people, this number has to be positive and above a certain number for the person to be alive.

Discrete Variables

There are also several types of discrete variables. In general, they are measurements that can only take on a certain (small) number of values. "Binary" measurements only have two values, such as "gender" (male, female). "Ordered categorical variables" are measurements that can be given an order: So, we can give a medication to a bunch of patients and then measure the result as "worsening", "no change" or "improvement".

Some categorical variables have no objective way to order: Brands of cars, for example. As the number of categories gets large, the distinction between continuous and discrete becomes more blurred, but this phenomenon does not present problems.

Other characteristics of variables (both continuous and discrete)

Bounded variables -- above and below, above only, below only: percentages of tax are bounded below by 0 and above by 100%;

Unbounded variables -- above, below and both: salaries are bounded below by 0, but are unbounded above;

Complete observations (no missing data)

Structurally missing data: a good example of this phenomenon is a study of survival. Let's say, we treat cancer patients and see how long they survive. The measure of interest is the time of survival, or time until death. If we are lucky, there will be many survivers at the end of the study who will obviously not have a time of death. Their time of survival is missing, but we know that it is after the date of our last information.

Data missing at random

Sampling, Independence and Repeated Measures

Suppose you wanted to measure / estimate the percentage of Brits who want a new referendum on Brexit. We could try to ask all registered voters, but the effort would be to long and costly to make it practical. Thus, we would look for a sample of voters, ask them and report the percentage of our sample. This pragmatic approach brings up many questions: How do we select the people in our sample? How many people do I have to ask? Obviously, if I ask three people, my estimates will be 0, 1/3, 2/3 or 1 depending on the responses of the people in the sample. If I want an accuracy of 0.03, I can only get that close to one of the 4 possibilities, so I should most likely want to ask more people in order to cover the whole gamut of prossible estimates between 0% and 100% (or fractions from 0 to 1).

Once I know how many people I want in my sample, how do I draw the sample from my target population so as to have a representative sample (ie the fraction of "yes" responders in the sample is close to that of the whole voting population.

Finally, I would most likely want my observations to be independent of one another. Clearly, if I wanted to know about changes in attitudes, I would have to ask the same people repeatedly in order to see how their opinions have changed with time. We will first look at independent observations, since these methods are the most elementary and historically the first. Then we will consider correlated data (including repeated measures).


Please come back to this site soon. And thank you for your comments and suggestions:

Contact me at: dtudor@germinalknowledge.com

© Germinal Knowledge, All rights reserved.