F. Arteaga , M. Pérez-Bermejo 

Catholic University of Valencia San Vicente Mártir (SPAIN)
In this work we deal with the problem of determining the minimum sample size necessary in an experimental investigation. To illustrate this, we focus on one of the main problems of statistical inference: the comparison of means of a continuous random variable measured over two independent populations.

Our approach will, however, be different than usual. In general, the researcher begins by assuming that both populations are normal, fixes the difference in means that he considers significant, in practical terms (clinical significance), specifies some assumed values for the variances of both populations (which may be the same or different) and a desired statistical power, which is the probability of detecting the prespecified difference, if it really exists. This traditional approach has some advantages, such as the possibility of using a known formula, but it also has some associated drawbacks: ease of use is due to rigid assumptions under which the formula can be constructed (normality, independence and equality of variances). On the other hand, the formula masks the concept of statistical power remaining as a value specified by the user. Statistical power is a difficult concept to understand, especially when it has to be introduced as a parameter that determines the minimum sample size required. In practice, many researchers set it as 80% or 90%, without really knowing what that value implies.

In addition, experience tells us that, in reality, researchers already have a sample size in mind, which is usually conditioned by the availability of samples or their budget. With this, the researcher crosses his fingers so that the usual formulas give him the result he expects (we will also see that, although the sample size obtained by the formulas is much larger than what is actually available, all is not always lost, since many times we can reduce the required size using some techniques that help reduce the variance of our variable).

In our work we propose an alternative approach in which the researcher specifies the number of samples and estimates the associated statistical power. This approach could involve the use of specialized software, since the estimation of statistical power requires a Monte Carlo simulation. However, we offer a simple solution, using a simple Excel spreadsheet, of which we provide all the details.

We propose a constructive method with the following advantages:
1. Facilitates understanding of the concept of statistical power, since it is built from its definition.
2. The result also includes a confidence interval for the statistical power and a random sample that allows studying its distribution.
3. The method is adaptable to changes in conditions: lack of normality, equal or different variances, ...

In our method, the user specifies the expected variances for both populations and the number of samples available for each. The method generates a Monte Carlo sample for the T statistic (with or without assuming equal variances) and for the associated p-values. The proportion of p-values below the selected significance level (generally 5%) is used to estimate the statistical power and allows us to construct a confidence interval for this parameter.

Although we illustrate the method with the problem of comparing the means of two independent populations, we explain how to extend it to the comparison of proportions and other related problems.