In this section, we'll use the survival experiment to illustrate that both qualitative and quantitative understanding are important, and to show how to appropriately use statistics at several levels. We'll analyze data taken from an earlier experiment.
Exposure Time in Seconds | Count Plate 1 | Count Plate 2 | Count Plate 3 | Concentration Factor |
---|---|---|---|---|
0 | 129 | 127 | 119 | 1 |
5 | 101 | 140 | 109 | 1 |
10 | 96 | 82 | 62 | 1 |
15 | 39 | 32 | 29 | 1 |
15 | 298 | 357 | 322 | 10 |
20 | 149 | 122 | 128 | 10 |
25 | 52 | 38 | 52 | 10 |
30 | 22 | 27 | 24 | 10 |
These data show how important the dilution strategy is. If we had worked with the same dilution throughout, then after 30 seconds we'd probably only have seen 0 or 1 colonies/plate and not see clearly that longer exposures have a greater effect. Alternatively, if we had used the last dilution for the 5 second exposure, then we'd have had to count more than 1,000 colonies!
On the other hand, using these differing dilutions can be confusing. The 300+ survivors at a 15 second dose should be compared to the 1,250+ survivors we would have gotten had we used the same dilution factor for the 0 second dose. Convert all the data to the same dose to simplify the analysis. Here is the table we get if we convert all the data to the 10 dilution factor:
Exposure Time in Seconds | Count Plate 1 | Count Plate 2 | Count Plate 3 |
---|---|---|---|
0 | 1290 | 1270 | 1190 |
5 | 1010 | 1400 | 1090 |
10 | 960 | 820 | 620 |
15 | 390 | 320 | 290 |
15 | 298 | 357 | 322 |
20 | 149 | 122 | 128 |
25 | 52 | 38 | 52 |
30 | 22 | 27 | 24 |
We are justified in "massaging the data" in this way because otherwise we might
mislead people who haven't done the experiment, and who don't know or care
about the details of the dilution factors.
To figure out what is going on, and to explain the results to other people
plot the data on a graph. As an example we have plotted the results for plate #2 in
Figure 1. The difference between 25 and 30 second exposures on this graph is
hard to see because the graph covers a big range: 1400 to 27. All the exposures
are easier to see if you use a semi-log graph. On a semi-log graph, the vertical
scale has the same distance between 1 and 10 as between 10 and 100, and between
100 and 1000. These same data are plotted, much more clearly, on a semi-log
graph in Figure 2. We can see that all the data are different, but cannot yet draw
any conclusion about how survival depends on dose. In fact, this graph suggests
that a little UV actually helps colonies grow. If we plot ALL the data from the
table on this graph however, we learn more, as we see in Figure 3.
Figure 1: Linear plot of column two data.
Figure 2: Semi-log plot of column two data.
Figure 3: Semi-log plot of all data.
Figure 3 is revealing. In the first place, survival more clearly depends on
dose. The low dose behavior is still uncertain, but the idea that a little UV helps
survival looks less likely. We need more data to study this question. For larger
doses, the log of the surviving colonies clearly decreases roughly linearly as the
dose increases. Note the smoothing effect of lots of data: each individual data
point represents some random fluctuation just like the dilutions do, and each point
also contains some experimental errors, but the errors and fluctuations push one
point one way, another point in the opposite direction, and so the collection of
points becomes much more useful than the individual ones.
First, we have identified an effect to be studied.
This plot can effectively demonstrate the results from an entire class. It is
actually a form of "statistics": we can roughly estimate a number of survivors at
each dose, note how much variability there is in this number and, using that
marvelous analytical engine, our brain, even fill in an approximate line through
the data. The next step is to make these rough insights quantitative, but the three
steps above are useful in themselves if you have limited time.
Second, we've described the result qualitatively.
Third, we've presented all our data clearly on a useful graph. A Second Cut
Presenting all the data at each dose gives us too much information because
the experimental uncertainties in each point reported may obscure what is really
going on. Also, you don't want to report excess information to people who just
want to know how dangerous UV is to yeast cells. A useful tactic is to determine
a single number at each dose that tells us how many survived. Most classes doing
these experiments will decide to "use the average" at each dose for that single
number. If all students use the same procedure then this is a good strategy.
Returning to our "raw data" we produced this table of the average number of
colonies/plate versus dose:
Exposure Time in Seconds | Average Colonies per Plate | Concentration Factor |
---|---|---|
0 | 125 | 1 |
5 | 117 | 1 |
10 | 80 | 1 |
15 | 33 | 1 |
15 | 325 | 10 |
20 | 133 | 10 |
25 | 47 | 10 |
30 | 24 | 10 |
But what should we do if five runs at an exposure get 22, 25, 26, 31 and 196
cells? Should we just average these five and say the average is 60? Doesn't this
look a little misleading? Discussion will probably suggest that the odd result is
likely to be a wrong dilution, or some other procedural problem, and ought to be
left out. This may be a proper time to "massage the data" because we should not
report data that we have reason to believe are wrong. However, we should first re-check that exposure, keep the odd result in mind, and consider the possibility that
it reflects an interesting unanticipated effect.
To understand this better consider the results of the 5 second exposure in
our data. The points are 101, 109, and 140. The 140 is more than any of the 0
second plates, and 140 is really quite far from the 117 average. If we omitted this
number, the drop from 0 to 5 seconds would appear more like the other changes.
Should we drop it? Before deciding, this we'll calculate the other quantity that we
report when using only averages instead of all the data: the "standard deviation."
The standard deviation measures how variable the data are. For example,
the standard deviation of the closely bunched 0 second data is just over 4, while
that for the 5 second data is 17. You can figure out the standard deviation using
the 5 second data as follows.
First, find the average:
You square these deviations from the average to get a positive number that measures the deviation. The average of these numbers is
and is called the "variance" of the data. To get rid of the peculiar "cells squared," we take the square root:
For practice, figure out the standard deviation of the 0 second data (the
answer is 4.3). Most spreadsheets have statistical functions, and you should
experiment with yours to learn how to ask your computer for the average and
standard deviation of a bunch of numbers. (Be forewarned that some programs
offer two types of standard deviation, and only one of them works here. For
reference, we used Lotus 1-2-3.)
Now we can construct a table of both the average number of surviving cells
and the experimental variability in this number for each dose.
Exposure Time in Seconds | Average Colonies per Plate | Standard Deviation | Concentration Factor |
---|---|---|---|
0 | 125 | +/-ñ> 4.3 | 1 |
5 | 117 | +/-ñ> 17 | 1 |
10 | 80 | +/-ñ> 14 | 1 |
15 | 33 | +/-ñ> 4.2 | 1 |
15 | 325 | +/-ñ> 24 | 10 |
20 | 133 | +/-ñ> 12 | 10 |
25 | 47 | +/-ñ> 7 | 10 |
30 | 24 | +/-ñ> 2 | 10 |
Now, should we retain or throw away the 140 cell plate in the 5 second
exposure? The standard deviation of 101, 109, and 140 is 17, so 140 is just about
1 1/2 standard deviations away from the average. This is not very far; in fact, in
typical experiments about 1/3 of the points will be more than one standard
deviation away from the average. On the other hand, if the data for 30 seconds
had been 22, 25, 26, 31, and 196, then the average would be 60, and the standard
deviation about 35; 196 is almost 4 standard deviations from the average. Such a
result is extremely unlikely, and so we would drop that point and report only on
the other four. Thus we would report an average of 26 and standard deviation of
3.2. (See Figure 4 for a plot of the data in this table and the standard deviations
used as error bars.)
The two sets of data at 15 seconds furnish a final lesson about errors, the
usefulness of more data, and the justifiable massaging of data. The standard
deviation of the dilution 1 data is about 0.13 of the mean; therefore, we only know
the answer to 13%. The dilution 10 data are as is typical, more accurate: we know
the mean to about 7%. The larger the mean, the smaller the relative error (=
error/mean). We should probably use the more accurate data, but consider this
question before discarding the smaller numbers: How accurately do you suppose
the plates with 300+ colonies were counted? Mistakes easily occur when too
many colonies grow on a plate; we may miss colonies or have two colonies
counted as one because they are growing right on top of each other. If an average
of 500 colonies are growing on some plates, we would expect a lot of
experimental error because the number of colonies would probably be
systematically under-counted. This kind of error is called "systematic error." It is
an error that creeps into our data because of problems with our procedure. How
would you choose between data with a mean of 500 +/- 25 or of 50 +/- 8? We
would probably accept the 50 even though the relative error is 8/50 = 0.16,
because the data with the 500 mean is likely to contain a big systematic error. On
the other hand, a further dilution that gives data with a mean of 5 +/- 2 gives even
poorer data. Although zero error occurs in counting the colonies, the relative error
due to statistical fluctuations is 0.40. This sort of error, called "statistical error,"
is a big problem when you are dealing with small numbers. We could therefore
discard the plates giving 5 +/- 2 because of the large statistical error and just report
the 50 +/- 8 data. We should be conscious of statistical and systematic errors
when selecting data, and consider which of our procedures are likely to give the
most reliable data.
Discard the wide-spread prejudice that all data are sacred, but don't fall into the temptation to discard results just a little bit higher or lower than you were expecting. You would then report only the data that reinforce your expectations. Do not discard data because they disagree with a pre-existing theory but subject all results to the same rigorous scrutiny; accept nothing at face value. If you think that a procedure was faulty, then it is only honest to discard the data gathered using that procedure. No magic rule governs such decisions; we just have to think hard and be honest.
Figure 4: Semi-log plot of survival with a straight line
fit.
A Third Cut
We should also figure an appropriate standard deviation even when our data
appear too "good." In this experiment, the variability in the data at any dose is
caused by two factors: variability in technique and procedure, and the unavoidable
fluctuations we noted in the serial dilution experiment. (See the notes on statistics
and the serial dilution experiment.) If we recall that a fluctuation is roughly the
square root of the expected number of cells, and look at the table above, we realize
that the 0 second data are too closely bunched; we must have been lucky. When
we report these results or try to analyze them, we should report at least the
expected variability. In the next table therefore, we report standard deviations at
least as big as the square root of the average number of colonies on the plate.