Skip Navigation


IJE Advance Access originally published online on October 30, 2007
International Journal of Epidemiology 2007 36(6):1363-1369; doi:10.1093/ije/dym215
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
36/6/1363    most recent
dym215v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Boks, M P M
Right arrow Articles by Ophoff, R A
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Boks, M P M
Right arrow Articles by Ophoff, R A
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press on behalf of the International Epidemiological Association © The Author 2007; all rights reserved.

Investigating gene–environment interaction in complex diseases: increasing power by selective sampling for environmental exposure

M P M Boks1,*, M Schipper2, C D Schubart1, I E Sommer1, R S Kahn1 and R A Ophoff3,4

1The Rudolf Magnus Institute of Neuroscience, Department of Psychiatry University Medical Centre Utrecht, Utrecht, The Netherlands.
2Centre for Biostatistics, University of Utrecht, Utrecht, The Netherlands.
3The Rudolf Magnus Institute of Neuroscience, Department of Medical Genetics, University Medical Centre Utrecht, Utrecht, The Netherlands.
4The Center for Neurobehavioral Genetics, University of California, Los Angeles, CA, USA.

*Corresponding author. Rudolf Magnus Institute of Neuroscience, Department of Psychiatry B01.206, University Medical Centre Utrecht, PO box 85500, 3508 GA Utrecht, The Netherlands. E-mail: mboks{at}umcutrecht.nl


    Abstract
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 Acknowledgement
 References
 
Background The often limited influence of disease associated alleles on the vulnerability to complex diseases has lead to increased interest in environmental interaction with genotype. However, gene environmental interactions (GEIs) are not easily studied, since high numbers of subjects are required to detect GEI.

Methods and results This study provides a potential useful method to increase the power of such studies through selective sampling for environmental exposure. We show that selecting the top and bottom 10% regarding environmental exposure can lead to a 70% reduction in the required number of subjects for genotyping.

Conclusion This study demonstrates the potential usefulness of selective sampling in the study of the interplay between genes and environment. The reduction of required subjects can be particularly advantageous in studies where genotyping is extensive, such as in whole genome screens or in studies where phenotyping is expensive.


Keywords Genotype, gene–environment interaction, environmental exposure, sample size, quantitative trait

Accepted 26 September 2007

The often limited influence of disease-associated alleles on the vulnerability to complex diseases has lead to increased interest in environmental influences on complex diseases such as diabetes1 and schizophrenia2 and their interaction with genotype. Increasing evidence suggests that many disease-associated genes interact with environmental exposure. In schizophrenia, for instance, there is compelling evidence that the COMT(158) Val/Val allele (MIM + 116790 [OMIM] ) interacts with exposure to cannabis in increasing the risk for psychosis.2 Investigating such gene by environment interaction (GEI) can contribute to the identification of environmental as well as genetic risk factors for complex human diseases.

There are, however, several methodological and conceptual issues that complicate research of GEI. The low statistical power of traditional GEI studies being the major obstacle. Studies that simulated the power to detect GEI demonstrated that the sample sizes required to detect GEI are much larger than those necessary to detect genetic or environmental factors in isolation.3 Large samples will lead to high genotyping costs, particularly in the case where large amounts of genetic data are collected, for instance in studies where whole genome screens are performed. However, with the rapidly reducing costs of genotyping and the increasing complexity of phenotypes the measured there are also studies where phenotyping costs outweigh the genotyping costs. Studies where the phenotype is determined by Magnetic Resonance Imaging (MRI) such as the study of brain volumes in schizophrenia patients4 are good examples. To avoid high costs for such studies, alternative strategies that limit the number of required subjects can be of great use.

To improve the power for detecting GEI in case control studies the use of selective sampling methods has been proposed. Previous studies have pointed out that selecting subjects for genotyping, which are extremely discordant regarding phenotype can substantially improve the power to detect association and linkage.5–7 We propose that selecting subjects with extremely high and low environmental exposure can markedly improve the power to detect GEI in a similar way. In this study, we investigated the power of such a design by deriving the equations for power and subsequently performing power simulations.


    Methods
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 Acknowledgement
 References
 
If we consider a standard linear regression model in which the interaction term is defined by the presence of a difference between the two regression coefficients β1 and β23 and define β3 as the difference between β1 and β2 and genotype as Gi with value 1 when at risk and 0 if not at risk. In that case, the regression model can be expressed as:


Formula

Yij is the continuous distributed outcome; Xij is a continuous covariate reflecting the environmental exposure and Gi the genotype (whether or not at risk). The {varepsilon}ij is a stochastic error term and assumed to be normally distributed with variance Formula .

If we then define m as the ratio between the two sample sizes (m = N0/N1) and if the distribution in the at-risk and not at-risk group is normal and equal, then the power is defined by the following equation in groups with unequal sample size (for inference see Appendix):


Formula

T2n–4 signifies a cumulative t distribution with 2n – 4 degrees of freedom.

To investigate whether the power to detect a GEI improves as a result of selective sampling strategy we used simulation with R.8 Different allele frequencies were used to determine the proportions of the risk and non-risk group based on the genetic model under the assumption of Hardy–Weinberg equilibrium. When p is the frequency of the rare allele, the frequency of the genotypes in the not at risk group is equal to 1 – p2 for a dominant and (1 – p)2 for a recessive model. In the at-risk group, genotype frequencies are p(2 – p) in a dominant and p2 in a recessive model. The proportion of risk/non-risk is therefore determined by p(2 – p)/1 – p2 in a dominant model and p2/(1 – p)2 in a recessive model. We simulated 800 populations with sizes ranging from 10 to 8000 and a normally distributed outcome variable with mean 0 and variance 1. We calculated the power under different sample strategies and model parameters by means of the analytical derived equation. We modelled the differences of the effect of the environmental exposure on the trait between genetically predisposed and non-predisposed. These interaction effects are reflected by the differences in regression coefficients. The graphs reflect situations, where the size of the interaction results in an absolute difference in coefficients of 0.1–0.8 equivalent to 1 and 64% of explained variance. For other situations, an R script to calculate the power is available at request. These differences are applicable to the whole range of environmental effects and are also independent of the genetic effect size. Although the effect of many environmental exposures is fairly strong, the interaction with genotype is likely to be conferred by multiple genes in most if not all complex disorders. The effects of a single genetic variant will therefore be much smaller. For the purpose of calculating the power for genetic studies, situations where a single genetic variant explains about 5% of the variance of the interaction-term (similar to the average explained variance of most complex traits) are more realistic. This would correspond to a difference in regression coefficients of 0.23. The effect size of genotype on the trait is relevant since a small genetic effect will reduce the probability the effect of the environment will be substantially different for subjects with other genotypes. However, this issue is subject to debate. Some authors argue that interaction is unlikely for genes that do not show an association to a trait to be involved in GEI,9 whereas others point to the possibility to find interaction of genes in exposed samples that have not previously been identified as associated with the trait.10 A similar argument can be made for the effect of environmental exposure. If there is no known effect of the environmental exposure on the trait, the probability that there is an interaction with genotype is reduced. However, there may be environmental exposures that can only be identified in subjects with genetic vulnerability.


    Results
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 Acknowledgement
 References
 
The results of this power simulation are plotted in Figures 1 and 2. They present the power as a function of sample size for different allele frequencies and a fixed value of β3 of 0.1, in a recessive and dominant model, respectively.


Figure 1
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Power as a function of sample size in a recessive model for different allele frequencies and sample strategies

 

Figure 2
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 Power as function of sample size in a dominant model for different allele frequencies and sample strategies

 
Figures 3 and 4 show the required sample size for a power of 0.8, with an {alpha} of 0.05, as a function of sample percentages, for different allele frequencies and different size of the interaction effect (β3) in a recessive and dominant model, respectively.


Figure 3
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Required sample size for different strengths of the interaction (β3) as function of sample strategy in a recessive model

 

Figure 4
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Required sample size for different strengths of the interaction (β3) and allele frequencies as function of sample strategy in a dominant model

 
The figures show that the power is dependent on the genetic model, the allele frequencies and the size of the interaction effect (β3). Figures 3 and 4 show that the absolute reduction of the required number of subjects depends on the genetic model, allele frequency and the strength of the interaction. In contrast, the relative reduction in the required sample size (as compared to a non-selected sample) only depends on the selection strategy. Table 1 shows the reduction in the number of subjects as a percentage of the originally required number of subjects for different sample strategies, which illustrates the beneficial effect of this strategy on required sample sizes. When sampling from the top and bottom, 10% of subjects, a reduction in the number of required subjects of 69.3% can be obtained.


View this table:
[in this window]
[in a new window]

 
Table 1 Reduction in the required sample size to obtain the same power as a percentage of the originally required number of subjects for different selection strategies

 

    Discussion
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 Acknowledgement
 References
 
We investigated whether sampling subjects from the extremes of environmental exposures can increase the power to detect gene–environment interactions.

The simulations demonstrate that selective sampling can indeed significantly increase the power. The figures and formulas provided earlier can be used to estimate the required sample size under different conditions.

To clarify the potential benefit of this strategy, an example is provided subsequently.

We aim to study the interaction between cannabis exposure and the risk for psychosis. From over 4000 subjects in the age between 18 and 25, we will select 200 subjects with no cannabis exposure and 200 heavy users of cannabis whose cannabis exposure is in the top 10%. With an allele frequency of the disease-associated COMT allele of around 0.211 and an expected interaction effect of around 0.25,2 the required number of subjects in a recessive model is about 165 using the selective sampling approach. Thus, we aim to replicate the interaction found in an earlier study using an unselected sample with 805 subjects.2 This demonstrates that the numbers included in our study will also be sufficient to detect smaller interactions or those conferred by genes with lower (less favourable) allele frequencies or under a different genetic model.

Our proposed sampling strategy does not preclude the possibility to study the effect of genotype or environmental exposure on the trait. The power of this approach in respect to the genetic effect will be similar to extreme sampling approach.5 The benefit of selection for environmental exposure is limited to increasing the power to detect interaction. In a large birth cohort, with measures of both outcome and the environmental factor, our strategy clearly has major advantages when selective sampling for both exposure and phenotype is possible. It will increase the possibility to detect GEI, whilst preserving power to detect main effects. However, if the sample is too small to effectively sample the extreme for exposure and maintaining sample size there is no other option than to broaden the inclusion. However, there is a point to still sample from the extremes of environmental exposures. Table 1 shows that sampling the top and bottom 40% of the population for exposure (including 80% of the population) still constitutes 19.7% reduction of required subjects. Since the number of required subjects to detect GEI is generally higher then the number of subjects needed to detect gene main effects, this reduction generally does not preclude a successful analysis of genetic or environmental effects. Indeed, there are sophisticated statistical tools available that allow simultaneous analysis of genetic, environmental and interaction effects in case control studies.12 However, in studies where measuring the environmental exposure is the most costly such as in the case of K-XRF for measuring bone lead or plasma organochloride measurements as a measure of pesticide exposure, our method has major disadvantages. Environmental exposure of large numbers of subjects would need to be measured for screening without being able to include them in the study. In these situations, the case-only study provides a better alternative to our approach or a straightforward case-control design. The case only design has improved power up to 40% compared with traditional case-control designs13 and is not dependent on extensive screening. However, in situations where both designs can be applied our sample strategy can have superior power compared with the case-only design. Additional advantages over the case-only design are the possibility to simultaneously investigating genetic and environmental effects and easier implementation in cohort studies or case-control designs, where the controls are already available.

For successful application of this approach, the trait as well as the environmental exposure under study should be measurable on a continuous (or at least ordinal) scale. Fortunately, this is possible in most complex diseases. In diabetes, for example, insulin requirements can serve as the quantitative phenotype and dietary intake (in calories) as environmental exposure. Nevertheless, previous studies have demonstrated the need to measure both the environmental exposure as well as the complex trait as precisely as possible, in order to prevent significant loss in power.14,15 Another potential limitation of our approach is the risk that the selective sampling approach leads to population stratification. As with most selective sampling designs, our strategy carries a risk of introducing confounding. To counteract confounding with respect to genetic vulnerability, care should be taken to also sample in the less obvious groups, generally those subjects with low environmental exposure, but high symptom score. Failing to do so will lead to a selection bias resulting in exclusion of subjects with high genetic vulnerability. Subjects whom otherwise would have been included in the group with low exposure but high symptom score. This risk can be limited by ensuring that there are no differences regarding phenotypes between the two different exposure groups. In our example, it is required that psychosis scores between the user and non-user group (in addition to the other requirements) do not differ in general measures to limit the risk of population stratification due to differences in population ancestry that require matched controls (age, gender and ancestry).

There are also more general limitations in gene–environment interactions studies that are not related to our design specifically. A major point is the fact that these studies are based on the assumption that there is no gene–environment correlation. Although this assumption is generally made, there is increasing debate whether this assumption holds.16 Twin studies suggest for instance that stressful live events are to some extent heritable.17 It should also be noted that this design is only capable of investigating multiplicative interactions and therefore additive interactions will remain obscured. A final limitation is the assumption that a statistical interaction also reflects a biological interaction. There is substantial debate about whether these assumptions are correct.9 Overall, our study demonstrates the potential usefulness of selective sampling in the study of the interplay between genes and environment, which is an essential step towards the understanding of the biological background of many, if not all, complex diseases.


    Appendix
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 Acknowledgement
 References
 
The regression model can be expressed as:


Formula

Yij is the continuous distributed outcome; Xij is a continuous covariate reflecting the environmental exposure and Gi the group (whether not at risk). The {varepsilon}ij is a stochastic error term and assumed to be normally distributed with variance Formula .

The power to measure interaction is therefore determined by the power to refute the hypothesis

H0: β3 = 0 vs H1: β3 != 0.

We defined m as the ratio between the two sample sizes (m = N0/N1). If we assume an equal distribution of the environmental exposure in the risk and non-risk groups:


Formula

In that case, β3 is normally distributed with mean β3 and variance:


Formula

Using the distribution we can now derive the formula for the required subjects to test the null-hypothesis with a set {alpha} of 0.05 and a power of 0.8 for groups with equal sample size:


Formula

And for groups with unequal sample size:


Formula

The power for groups with equal sample size is then defined by:


Formula

in which T2n–4 signifies a cumulative t distribution with 2n – 4 degrees of freedom. In groups with unequal sample size the power is defined by:


Formula


    Acknowledgement
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 Acknowledgement
 References
 
We are grateful to Bobby Koeleman and Carolien van Baalen for advice.

Conflict of interest: None declared.


    References
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 Acknowledgement
 References
 
1 Permutt MA, Wasson J, Cox N. Genetic epidemiology of diabetes. J Clin Invest (2005) 115:1431–39.[CrossRef][Web of Science][Medline]

2 Caspi A, Moffitt TE, Cannon M, McClay J, Murray R, Harrington H, et al. Moderation of the effect of adolescent-onset cannabis use on adult psychosis by a functional polymorphism in the catechol-O-methyltransferase gene: longitudinal evidence of a gene X environment interaction. Biol Psychiatry (2005) 57:1117–27.[CrossRef][Web of Science][Medline]

3 Luan JA, Wong MY, Day NE, Wareham NJ. Sample size determination for studies of gene-environment interaction. Int J Epidemiol (2001) 30:1035–40.[Abstract/Free Full Text]

4 Hulshoff Pol HE, Brans RG, van Haren NE, et al. Gray and white matter volume abnormalities in monozygotic and same-gender dizygotic twins discordant for schizophrenia. Biol Psychiatry (2004) 55:126–30.[CrossRef][Web of Science][Medline]

5 Abecasis GR, Cookson WO, Cardon LR. The power to detect linkage disequilibrium with quantitative traits in selected samples. Am J Hum Genet (2001) 68:1463–74.[CrossRef][Web of Science][Medline]

6 Camp NJ, Bansal A. The effect of selective sampling on mapping quantitative trait loci. Genet Epidemiol (1997) 14:767–72.[CrossRef][Web of Science][Medline]

7 Risch NJ, Zhang H. Mapping quantitative trait loci with extreme discordant sib pairs: sampling considerations. Am J Hum Genet (1996) 58:836–43.[Web of Science][Medline]

8 R: A Language and Environment for Statistical Computing. (2006) Vienna, Austria: R Developement Core Team, R foundation for Statistical Computing.

9 Clayton D, McKeigue PM. Epidemiological methods for studying genes and environmental factors in complex diseases. Lancet (2001) 358:1356–60.[CrossRef][Web of Science][Medline]

10 Moffitt TE, Caspi A, Rutter M. Strategy for investigating interactions between measured genes and measured environments. Arch Gen Psychiatry (2005) 62:473–81.[Abstract/Free Full Text]

11 Shifman S, Bronstein M, Sternfeld M, et al. A highly significant association between a COMT haplotype and schizophrenia. Am J Hum Genet (2002) 71:1296–302.[CrossRef][Web of Science][Medline]

12 Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ. Exploiting gene-environment interaction to detect genetic associations. Hum Hered (2007) 63:111–19.[CrossRef][Web of Science][Medline]

13 Yang Q, Khoury MJ, Flanders WD. Sample size requirements in case-only designs to detect gene-environment interaction. Am J Epidemiol (1997) 146:713–20.[Abstract/Free Full Text]

14 Garcia-Closas M, Rothman N, Lubin J. Misclassification in case-control studies of gene-environment interactions: assessment of bias and sample size. Cancer Epidemiol Biomarkers Prev (1999) 8:1043–50.[Abstract/Free Full Text]

15 Wong MY, Day NE, Luan JA, Wareham NJ. Estimation of magnitude in gene-environment interactions in the presence of measurement error. Stat Med (2004) 23:987–98.[CrossRef][Web of Science][Medline]

16 Jaffee SR, Price TS. Gene-environment correlations: a review of the evidence and implications for prevention of mental illness. Mol Psychiatry (2007) 12:432–42.[Web of Science][Medline]

17 Bolinskey PK, Neale MC, Jacobson KC, Prescott CA, Kendler KS. Sources of individual differences in stressful life event exposure in male and female twins. Twin Res (2004) 7:33–38.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Schizophr BullHome page
J. van Os, B. P. Rutten, and R. Poulton
Gene-Environment Interactions in Schizophrenia: Review of Epidemiological Findings and Future Directions
Schizophr Bull, November 1, 2008; 34(6): 1066 - 1082.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
36/6/1363    most recent
dym215v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Boks, M P M
Right arrow Articles by Ophoff, R A
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Boks, M P M
Right arrow Articles by Ophoff, R A
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?