Arnold S. Chamove, B.A., M.A., Ph.D., M.Phil., F.I.Biol., Department of Psychology, Palmerston North, New Zealand (A.S.Chamove@Massey.ac.NZ)

How much is it worth to be able to cut the number of animals needed in an experiment in half? Is it worth your reading on?

One way of reducing the number of animals needed in an experiment is to use more sophisticated statistics, not a more difficult method, in fact less difficult once you know how. I'll describe a technique enabling you to shave the number of animals used by about 30-40 percent or more and to do your statistical analyses faster and easier.

"What is the cost of all this generosity?" I hear you ask. To have all these advantages you must have some estimate of three things: your control/baseline mean, control/baseline variance, and the effect size. You will most likely have all of these already, but you will need to use them in a slightly different way than you have in the past.

To: Top of document | Introduction | Intuition | Power Analysis | Sequential Sampling | Ethical Consi derations | References

Try a test on yourself to see if you have been or would be incorrect in your estimates of the numbers of subjects needed: Most people are aware that the intelligence quotient (IQ) is devised so that it has a mean of 100 and a standard deviation of 15. That means that if you take a random sample of people, the mean of their IQs will be close to 100. To give some feeling for the IQ, the mean of students at university is higher, about 120; genius is 130. Having a standard deviation of 15 means that 68 percent of people fall within 15 points of the mean plus or minus; in other words 68 percent of people have an IQ of between 85 and 115. Ninety five percent of people have an IQ that falls within two standard deviations (30 points) of the mean. That is one of the valuable characteristics of the standard deviation, having 95 percent of the sample within two standard deviations of the mean.

With that in mind, if I said I had found something that would raise someone's IQ 30 points (or 20 points, or even 10 points), how many subjects do you feel you would have to test in order to come to a statistical decision at the 0.05 level for each of these three problems? To make it more realistic, a drug company has offered to pay you a lot of money for this substance that you claim will raise IQ, but you will have to pay them a lot before they start. Therefore you don't want to say it increases IQ if it doesn't as that will cost you a lot, nor say it doesn't if it does as you will miss out on a lot of money. Let's say the substance also causes side effects. So, you don't want to overestimate the number of subjects needed to be sure of detecting a difference, or underestimate to be cautious. If you have underestimated, you might miss something important. If you have overestimated, you might be exposing some of your subjects to unnecessary suffering. How many subjects do you think you would have to give the drug to see if the substance works? Give your estimate:

(a) for an increase of 10 points =____,

(b) for an increase of 20 points =____,

(c) for an increase of 30 points =____.

The true values are given in the last line of this paragraph. I predict you will have overestimated especially in the condition when the effect was the most extreme, the 30-IQ-point condition. The answer for 10 points is 20 subjects, for 20 IQ points is 5, and 30 IQ points need only two subjects to detect the difference.

When most of us are planning an experiment, there is more involved than just guessing. Normally, we decide on the number of subjects based on (a) how many subjects there are available, (b) the cost to us, or to them, of the test procedure, (c) and finally that feeling about how many we will need to make the sort of decision that we hope to make. If there are few subjects available, testing is costly or distressing to subjects, or we feel the effect we want to show is a big one, then we use fewer subjects. If there are lots of potential subjects, testing is quick or perhaps even beneficial to the subjects, or the effect that we are trying to detect is a tiny one, we select more subjects.

A common question that is posed by students or experimentally naive colleagues is, "How many subjects do I need?" The answer still most commonly given is, "As many as I can afford to test."

To: Top of document | Introduction | Intuition | Power Analysis | Sequential Sampling | Ethical Consi derations | References

A few of the existing statistical packages will compute the number of subjects needed with varying degrees of ease. Here, I have used STATISTICA by StatSoft (5) and I found it easy to use. Power analysis is hidden away in the Process Analysis section of STATISTICA. You may find it of interest to know that power analysis (and sequential sampling, described below) is commonly used by "quality control engineers...to determine how many items from a batch to inspect in order to ensure that the items in that batch are of acceptable quality." (5, p. 3,571).

To use the technique, you need to know three things: First, you need to know the comparison mean, the mean of the control group. In our example above the mean was an IQ of 100, but it could be the average number dying under some treatment, the percentage normally infected, the amount of pain under some procedure, the average frequency or duration of aggression without the recommended intervention, etc.

Second, you need some measure of variability, generally the standard deviation. We used the value of 15 in our IQ example as it is widely known. If you have a mean for your comparison condition, you will usually already have a measure of variability.

Your third requirement is an estimate of effect size. We are not ordinarily asked for an estimate of effect size, but we often have one in mind. To use the IQ example again, if our substance only increased IQ by one point, and even if the effect were statistically significant, we would say that the substance was not valuable, that the effect size was so small as to be uninteresting, that no one would buy it. We know that an extremely small effect size is worthless.

There are a number of things that can help in determining effect size. The most common is other research. I carry out research in the area of environmental enrichment, formerly with monkeys and more recently with farm animals (1). One difficulty in that area is deciding on an appropriate effect size, and I know of no research where anyone has reported estimating effect size.

Change is easy; improvement is easy too. Almost anything you do to a monkey alone in a zoo enclosure, to a rat living in a bare laboratory cage, or to a pig farmed in a small crate will produce a change in their behavior. But is that change large enough to be important, large enough to be worth the expense to make that change? Is the improvement to the welfare/behavior of chickens large enough to warrant the cost of increasing the cage size?

Other help in determining whether an effect size is sufficiently
large to be important is* percent of variance accounted
for.* This is measured by r^{2} in correlation and by
omega-squared (2^{2}) in analysis of variance.

Recently, I have been assessing the effects of visual shelter on levels of stress and aggression in farm animals. Now, if planting a row of trees will reduce aggression in deer, as it will (8), will you the farmer plant the trees? Before speculating on an answer, you might be excused for asking (a) the cost of the trees and (b) the degree to which the aggression will be reduced--that is, the cost/benefit analysis. The benefit question is the same question the researcher must ask him/herself to get an estimate of effect size. How big a reduction in aggression must I find before I conclude there are benefits in providing visual shelter?

The literature suggests that in monkeys, in bulls, and in rats, providing a visual barrier behind which some of the animals can hide will reduce aggression by about 50 percent. We might set our effect size near that value. Using that value, we can then calculate the number of subjects we need to test to see if visual shelter reduces aggression by at least half in deer. (In case you're interested, it does.)

The program STATISTICA asks for the mean under the null hypothesis (IQ=100), the standard deviation (IQ=15), the effect size (IQ=130 in our example above). It also asks for the alpha level (conventionally p=0.05), the beta level (conventionally=0.1), and whether the predictions are one- or two-tailed. After you enter these numbers, STATISTICA presents you with the number of subjects you will need to test--sounds easy.

Then you carry out your experiment, do your test for statistical significance (for example, t-test or ANOVA), and draw your conclusions. But there is an even easier method and one that is even more powerful.

To: Top of document | Introduction | Intuition | Power Analysis | Sequential Sampling | Ethical Consi derations | References

In the fixed sampling experiments we all are familiar with, there are two possible decisions we can make: Either reject the null hypothesis and then conclude the groups are different, or fail to reject the null hypothesis and by default accept an alternative hypothesis, usually concluding the groups are not different. In the sequential experiment there are three possible decisions. The first two are the same as above, but the third is a different one: Either keep sampling, or stop sampling and conclude that a decision about the null and alternative hypotheses cannot be made. In other words, more samples are needed before a decision can be made as to whether the two groups differ or not.

This third alternative, let's call it the devil's alternative, is likely only when the variability is large and the difference between the means is very small. This condition is the only penalty of using sequential sampling, other than having to have an estimate of effect size.

To actually employ sequential sampling, you need only have the
same equipment as for the power analysis--mean, variability, and
effect size. You can enter these values into STATISTICA and,
instead of pressing the *fixed sample *button, you press
the *sequential sampling* button.

To illustrate, we will use some real data and superimpose that data on the graph generated by Statistica, just as one would do in reality. The only difference is that I have already tested all the subjects, whereas in a sequential sampling experiment, one would only test one or two subjects at a time. We will return to a monkey enrichment experiment I did, illustrated in a widely displayed video (6). To test to see if a small forage box would improve conditions for individually housed marmoset monkeys, I decided that the monkeys would have to be more active, in fact at least double the levels of activity when monkeys live in a bare cage without the ability to forage. The control or baseline levels for a group of monkeys living singly in bare cages was 12.9 percent of the day spent active (with a standard deviation of 9.7). Our effect size of doubling means that with the forage box, they must spend at least 26 percent of the day active for us to conclude that the effect of the forage box is important.

To restate the techniques that I could have used to decide how many subjects to use, I could just guess how many to use. How many monkeys would you say I would need to use in order to see if the forage box increases total daily activity? Answers please, now ____. I would have estimated a minimum of 15 but only 13 were available to test at baseline; this further decreased to 9 by the time of the retest. I could have done a power analysis, so I have done one now. In the power analysis, STATISTICA helps us calculate that we would need nine monkeys. Or I could have used sequential sampling, and I have done this in detail below.

**Figure 1. Cumulative deviations from the control mean in activity by nine individually housed Cotton-top tamarins.**

The following is the procedure for sequential sampling. To see if the manipulation is significant at the prescribed effect size, you simply graph your data. To do this you give STATISTICA the information we have already ascertained, namely mean, standard deviation, and effect size, and it will produce the figure reproduced here (fig. 1). Superimposed on the graph is the data I actually obtained. STATISTICA will plot that data too if you have it. The sequential sampling plan produces a graph on which there is one parallel corridor (a two-tailed test has two corridors) leading gradually away from the baseline of "no difference." You plot the mean obtained from each subject in turn, actually the deviation score for that subject from baseline. If that plot remains within the corridor, it means that you can neither conclude that your forage box is effective nor that it is useless, that is, you should keep sampling. If the plot drops below the corridor you can conclude that behavior with the forage box is no different from the control condition; if the plot goes above the corridor, you conclude the forage box is effective in improving behavior, that is reject the null hypothesis.

In the example we have used, I could have tested just one subject and plotted her data on the graph. You can see if that subject's improvement score had been over 30, that is if the subject had increased her score from the baseline mean of 13 to at least 43, her score would have fallen above and outside the corridor and I would have been able to stop testing and then conclude that the forage box had at least doubled activity and the effect was significant at p=0.05. In fact her score was only 33, having improved 20 points above baseline, and that score of 20 is plotted on the graph. The second monkey had a score of 28, 15 above baseline, and so 15 is plotted. The cumulative scores of the two monkeys now extends above the corridor, just above the corridor, and testing can be stopped.

To arrive at a decision with the same degree of certainty as I did with just 2 subjects using sequential sampling, I would need to have tested 9 subjects had I done a power analysis, and I would have used 15 had I gone by my own intuition based on over 30 years of research with monkeys. You can see that a power analysis will reduce the number of subjects used by almost half, but the sequential sampling technique reduces the number needed even further. In this example, sequential sampling reduced the number of subjects by 70 percent from the power analysis and even more from my educated guess.

Why such a huge reduction? If the forage box had improved behavior only by the bare minimum allowed, 13, it still would have taken only 4 subjects before going outside the corridor. But because the box was so effective, almost trebling the amount of activity, the scores rapidly exceeded that corridor of "no significant difference." That is one of the unexpected benefits of sequential sampling not found in any form of fixed sampling. In the case that the effect size is even greater than postulated, even fewer subjects are needed.

Imagine that in the study described above, the experimental manipulation was a painful one. We would want to know if it worked but would also wish to keep the subjects used to a minimum in case it was not effective. Or we might want to minimize subject use because we wanted to know if the compound was toxic. In this example, I should use only two subjects and could if I use sequential sampling. To reiterate, in almost all cases, sequential sampling procedures are preferable to fixed sampling procedures because sequential sampling is more powerful in that fewer subjects are required in order to arrive at a decision with the same degree of certainty.

To: Top of document | Introduction | Intuition | Power Analysis | Sequential Sampling | Ethical Consi derations | References

Why don't ethics committees insist on it? Probably because they have never heard of it. Until recently, the computations to calculate power and sequential sampling have been tedious, and the technique has not been described in textbooks. STATISTICA has changed all that.

Do you want to read more about sequential sampling? The math-phobic can read about these techniques in a chapter by Edwards (2), the behavioral scientist in a friendly text by Leavitt (3), the stats-sophisticate can consult Pyzdek (4) or go back to the man himself, to Wald (7), and finally the statistics package STATISTICA (5) will take the practitioner quickly and painlessly through the mechanics of actually computing the numbers. I was unable to find procedures for sequential sampling in other commonly used statistical packages.

To: Top of document | Introduction | Intuition | Power Analysis | Sequential Sampling | Ethical Consi derations | References

- Chamove, A.S. (1994). Enrichment: Past and future.
*Australian and New Zealand Council for the Care of Animals in Research and Teaching News*7: 4-5. - Edwards, H.P. (1986). Sequential experimentation, or count
your chickens as they hatch. In
*The Fascination of Statistics,*R.J. Brook, G.C. Arnold, T.H. Hassard, R.M. Pringle (eds.). Marcel Dekker: New York, pp. 193-202. - Leavitt, F. (1991).
*Research Methods for Behavioral Scientists,*Wm. C. Brown: Dubuque, IA., pp. 250-259. - Pyzdek, T. (1989).
*What Everyone Should Know about Quality Control*, Marcel Dekker:New York. - StatSoft, Inc. (1994).
*STATISTICA for Windows.*StatSoft Inc., Tulsa, OK., pp. 3,567-3,612. - UFAW (1990).
*Environmental Enrichment: Advancing Animal Care*[video]. (Available from Universities Federation for Animal Welfare, 8 Hamilton Close, South Mimms, Potters Bar, England, EN6 3QD.) - Wald, A. (1947).
*Sequential Analysis*, Wiley: New York. - Whittington, C.J. and Chamove, A.S. Effects of visual cover
on farmed red deer behaviour.
*Applied Animal Behaviour Science*(in press).

This article appeared in the

**Go to:**

Contents, *Animal Welfare Information Center
Newsletter*

Top of Document

The Animal Welfare Information Center

U.S. Department of Agriculture

Agricultural Research Service

National Agricultural Library

10301 Baltimore Ave.

Beltsville, MD 20705-2351

Phone: (301) 504-6212

FAX: (301) 504-5181

Contact us: http://awic.nal.usda.gov/contact-us