# Should p-values after model selection be multiple testing corrected?

I was recently comparing different likely models (each was a different time profile) for each gene in time-series RNA-seq data. Since I did not have simple nested models, I was forced to use (as the simplest option) the Akaike Information Criterion (AIC) (I could have used the Bayesian Information Criterion as well) to select the “best” model. In the analysis of genomic data, the next step is typically thresholding the corrected p-values (i.e., after correcting for multiple testing) to identify genes with statiscally significant fits to the model(s).

Of course, AIC does not provide an overall quality of fit for a gene, such as p-value, but rather AIC computes only a relative measure of the quality of fit of the models for a single gene. Since I was comparing simple linear models, I could obtain p-values for each model fit using the standard F-test. The question then arises, since model selection involves multiple fits (i.e., tests), should the p-value (from the best model) for the individual gene already be multiple-testing corrected ?

We are now going to use my standard trick to test if multiple testing is needed. It goes as follows: If the approach is applied to random data under the null distribution $H_0$, then final p-values produced by the approach must continue to be uniformly distributed. If the distribution is either conservative or anti-conservative, then the approach is not statistically sound/consistent.

We generate a random dataset of 10^{4} genes each measured at 10 different points (one time unit part). We then apply three different time profiles for each gene: a linear trend, a quadratic profile or a sinusoidal profile of period ten time units.

As we see above, the p-value distribution is non-uniform and anti-conservative. In other words, the p-values will produce more false positives than expected under $$H_0$$.

So, we inspect the p-values after multiple-testing correction for each gene using Benjamini-Hochberg. Clearly, the p-value distribution is indeed almost uniform as it should be under $$H_0$$, although the distribution is a bit conservative (p-values biased away from small values). This suggests that the actual FDR using this model selection and correction approach will be smaller than suggested by theory, which is still okay.

If the p-value based quality of fit is desired after model selection, the p-value has to be multiple testing corrected according to how many models are being compared.