Ethics
Our research complies with all relevant ethical regulations. All studies included in this research were approved by the relevant board or committee. UKBB has approval from the North West Multicentre Research Ethics Committee (REC reference 13/NW/0157) as a Research Tissue Bank (RTB) approval, and informed consent was provided by each participant. This approval means that researchers do not require separate ethical clearance and can operate under RTB approval. This RTB approval was granted initially in 2011 and is renewed every 5 years; hence, UKBB successfully renewed approval in 2016 and 2021. All work in UKBB reported in this manuscript was performed under UKBB application numbers 20361 and 52293. The collection of participant information adhered to the AoU Research Program Operational Protocol (https://allofus.nih.gov/article/all-us-research-program-protocol). The AoU Institutional Review Board (IRB) (https://allofus.nih.gov/about/who-we-are/institutional-review-board-irb-of-all-of-us) is charged with reviewing the protocol, informed consent and other participant-facing materials for the AoU Research Program. The IRB follows the regulations and guidance of the Office for Human Research Protections for all studies, ensuring that the rights and welfare of research participants are overseen and protected uniformly. For ALSPAC, ethical approval for the study was obtained from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committees.
UKBB WGS data processing
The WGS of UKBB participants is described in detail in ref. 7. In brief, 490,640 UKBB participants were sequenced to an average depth of 32.5× using Illumina NovaSeq 6000 platform. Variants were jointly called using Graphtyper52, which resulted in 1,037,556,156 and 101,188,713 high-quality (AAscore <0.5 and <5 duplicate inconsistencies) single-nucleotide polymorphisms (SNPs) and indels, respectively.
We further processed the jointly called genotype data in Hail v.0.253, where multi-allelic sites were first split and normalized. Variants were then filtered based on low allelic balance (ABHet <0.175, ABHom <0.9), low quality-by-depth (QD) normalized score (QD < 6), low phred-scaled quality score (QUAL < 10) and high missingness (call rate <90%). For the analysis in the European-ancestry cohort (see below), we further removed variants that failed test for Hardy–Weinberg equilibrium (P < 1 × 10−100) within this cohort.
Variants were annotated using Ensembl variant effect predictor (VEP)54 v.108.2 with the LOFTEE plugin55. Combined annotation-dependent depletion (CADD) annotations were based on precomputed CADD56 v.1.7 annotations for all SNPs and gnomAD v.4 indels. REVEL15 annotations were obtained from the 3 May 2021 release of precomputed REVEL scores for all SNPs. We prioritized the individual consequence for each variant based on severity, which was defined by VEP. The PTV category is the combination of stop-gained, frameshift, splice acceptor and splice donor variants. The missense and synonymous variants were adopted directly from VEP. Only the variants on autosomes and chromosome X, which were within ENSEMBL protein-coding transcripts, were included in our downstream analysis.
Demographics of the study population are presented in Supplementary Table 16.
European ancestry definition in UKBB WGS
We defined a European-ancestry cohort as people who most resembled the NFE (non-Finnish European) population as labeled in the gnomAD v.3.1 dataset55. This NFE group was one of nine ancestry groups labeled in gnomAD, which was based on Human Genome Diversity Project and 1000 Genomes Project samples. Variant loadings for 76,399 high-quality informative variants from gnomAD were used to project the first 16 principal components onto all UKBB WGS samples. A random forest classifier trained on the nine ancestry labels in gnomAD was then used to calculate probabilities that reflect the similarity between the UKBB participant and each of the gnomAD ancestry labels.
Phenotype preparation in UKBB
Binary outcomes were prepared using a combination of hospital episode statistics (UKBB showcase IDs: 41202, 41204, 41200, 41210) primary care records (UKBB showcase IDs: 42040), death certificates (UKBB showcase IDs: 40001, 40002) and self-reported medical conditions (UKBB showcase ID: 20002). Qualifying codes pertaining to each condition are listed in Supplementary Table 10. Any participant with a qualifying code was considered a case, those without a qualifying code were considered controls. For T2D, qualifying terms included codes specifying diagnoses of noninsulin-dependent diabetes, T2D, and insulin-treated T2D. Participants who self-reported a history of T2D were also classified as cases. For CKD, diagnostic codes included those specifying chronic renal failure, chronic renal impairment, CKD, end-stage renal failure, hypertensive renal disease with renal failure, or codes indicative of preparation or receipt of renal replacement therapy. Participants who self-reported renal/kidney failure, dialysis or procedures to prepare for peritoneal or hemodialysis were specified as cases. For T2D and CKD phenotype definitions, all participants who did not meet the qualifying terms were classified as controls (see Supplementary Table 10 for a full list of qualifying codes). Thinness was defined as having the lowest 5% of BMI. Metabolic dysfunction-associated steatotic liver disease required the presence of steatosis and a qualifying metabolic risk factor, namely obesity, T2D or other metabolic dysregulation57. Steatosis was defined using the fatty liver index, a composite measurement of triglycerides, glutamyl-transferase, waist circumference and BMI that ranges from 0 to 100 (ref. 58). Specific fatty liver index cut-offs according to participant sex, BMI and waist circumference as described in ref. 59 were applied to define presence of steatosis.
Genome-wide gene-burden testing in the UKBB
BOLT-LMM60 v.2.4.1 was used as our primary analytical software to conduct gene-burden tests.
To run BOLT-LMM, we first derived a set of genotypes consisting of common (MAF >0.01) linkage disequilibrium (LD)-pruned (LD r2 < 0.1) variants in people with WGS data to build the null model. Pruning was conducted using PLINK261 on a random subset of 50,000 individuals (options in effect: –maf 0.01 –thin-indiv-count 50,000 –indep-pairwise 1,000 kb 0.1).
We adopted the same strategies used in our previous analyses using WES data9,11. We generate the dummy genotype files in which each gene-mask combination was represented by a single variant, which were required as the genotype input for BOLT-LMM. We then coded people with a qualifying variant within a gene as heterozygous, regardless of the total number of variants they carried in that gene. We then created the dummy genotypes for the MAF <0.1% high-confidence PTVs as defined by LOFTEE, missense variants with REVEL >0.5 and missense variants with REVEL >0.7. After getting all required inputs, BOLT-LMM was used to analyze BMI and T2D using default parameters except for the inclusion of the ‘lmmInfOnly’ flag. The covariates included in our analysis are age, age2, sex, age × sex, the first 20 principal components as calculated from all WGS samples and the WGS-released batch (Vanguard project, Sanger: 49,932, Sanger: 193,075, deCode: 247,504). Different from our previous studies, we included all samples without restricting their ancestries to maximize the sample size. Only people who withdrew consent or had missing phenotypes and covariates were excluded; filtering resulted in 481,137 and 489,941 samples remaining for BMI and T2D, respectively.
To identify single variants driving a given association within a single gene, we performed a leave-one-out analysis for all identified genes using a generalized linear model (GLM) in R v.4.0.2 by dropping the variants contained in the gene-mask combination one at a time.
To test whether our significant burden test results are independent of common-variant GWAS associations, we generated polygenic risk scores for each trait and included these as covariates in our linear mixed model. Independent genome-wide significant (P < 5 × 10−8) variants from existing single-variant GWAS summary statistics for each trait were first identified using GCTA-COJO62. Polygenic risk scores in each UKBB participant were then calculated as the weighted sum of the person’s genotypes across the significant single variants, where weights were derived from the variants’ beta coefficients in the corresponding GWAS. This score was then included as an additional covariate in the burden analysis as implemented in BOLT-LMM described above. As BOLT-LMM use a linear mixed model, we estimated and reported the OR using the generalized linear model in R v.4.0.2 for all T2D-associated genes.
In an additional analysis designed to exclude that our new, replicated rare variant associations were the result of confounding by LD with common variants we interrogated marker level results from WGS-analyses of BMI and T2D. Regional common variants that could conceivably be driving the rare variant associations (MAF >0.001, P < 6.15 × 10−7, ± 500 kb from index gene) were extracted and clumped (r2 < 0.001) to identify approximately independent variants, which were then included as covariates in a generalized linear model with the cognate gene-burden mask as the predictor variable of interest. As in our discovery analysis, age, age2, sex, age × sex and the first 20 principal components as calculated from all WGS samples, and the WGS-released batch were included as covariates.
Replication in AoU study
Participants analyzed in this study were selected from the AoU Research Program cohort36. The collection of participant information adhered to the AoU Research Program Operational Protocol (https://allofus.nih.gov/article/all-us-research-program-protocol). Detailed methodologies regarding genotyping, ancestry classification, quality control measures and the methodology for excluding related participants are thoroughly documented in the AoU Research Program Genomic Research Data Quality Report (https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report).
We conducted our analysis on short-read WGS data (v.7.1) subsetted to the protein-coding exome, focusing on two phenotypes: BMI and T2D. The analysis encompassed 219,015 unrelated people, including 112,526 of European ancestry, 46,414 of African/African American ancestry, 34,865 of American Admixed/Latino and 25,210 various other ancestries (see Supplementary Table 3 for detailed sample size information). Ancestry assignment was conducted centrally by AoU. Briefly, a random forest classifier was trained on data from the Human Genetic Diversity Project and 1000 Genomes Project. This classifier was then applied to the AoU data. Further information is available from the AoU (https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report-ARCHIVED-C2022Q4R9-CDR-v7).
BMI data were derived from the ‘body mass index (BMI) [Ratio]’ metric (Concept Id 3038553) within the ‘Labs and Measurements’ domain. BMI values <10 or >100 were excluded and the earliest remaining value recorded and corresponding age was used. The ‘Type 2 diabetes mellitus’ identifier (Concept Id 201826, https://databrowser.researchallofus.org/ehr/conditions/201826) in the ‘Conditions’ domain facilitated the identification of T2D cases, and the age corresponding to the earliest diagnosis of T2D was used. The participants’ ages were calculated by subtracting the birth year from the timestamp of the earliest record. Among these people, 32,462 were identified as T2D cases, and 186,553 served as controls. Only people aged over 18 years were included in the analyses. Only a small proportion of episodes that indicated a diagnosis of T2D had a contemporaneous BMI measurement. As such, to adjust T2D for BMI, we used two approaches: the median BMI value recorded was included as a covariate in the model or the BMI record closest to T2D was used. Population demographics by ancestry are described in Supplementary Table 17.
Gene-based burden tests were applied to variants with MAF <0.001 that met prespecified bioinformatic criteria and were in selected genes (for example, those significant in UKBB discovery). Note that, due to different population composition, variant MAF will differ between AoU and UKBB. Burden tests were conducted using STAAR (variant-set test for association using annotation information)63 implemented in STAARpipeline64 (R package v.0.9.7), with covariates adjustments for age, age2, sex, age × sex, and the first 16 principal components. The criteria for gene-burden masks followed the methodology of the main UKB analyses.
Power calculations
To estimate statistical power for replication in the AoU study, we first corrected effect estimates in the discovery analysis for winners’ curse using the bootstrap method65 implemented in the winner’s curse package in R (https://amandaforde.github.io/winnerscurse/). For T2D, the resulting effect estimates (betas from a linear mixed model) were transformed to odds ratios66 (https://shiny.cnsgenomics.com/LMOR/). Power calculations using the relevant winners’ curse corrected effect estimates were then conducted in genpwr (https://cran.r-project.org/web/packages/genpwr/vignettes/vignette.html).
UKBB WES processing
To quantify the gain from WGS versus WES in UKBB, we compared variant counts between our WGS data with those from the 450,000 original quality functional equivalence release of the UKBB WES data (454,756 participants total). We processed multisample pVCFs using Hail53 v.0.2, where multi-allelic sites were first split and normalized. Sites were then excluded if they failed the following quality metrics: for SNPs, ABHet <0.175, QD <2, QUAL <30, SOR >30, FS >60, MQ <40, MQRankSum <−12.5 and ReadPosRankSum <−8; for indels: ABHet <0.175, QD <2, QUAL <30, FS >200 and ReadPosRankSum <−20, resulting in 23,273,514 variants available for analysis. People with high heterozygosity rates, discordant WES genotypes compared to array and discordant reported versus genetic sex were removed, resulting in 453,931 participants. Variants were annotated using the identical VEP pipeline, LOFTEE, CADD and REVEL annotations as described for WGS.
PheWAS of identified BMI-associated and T2D-associated genes in UKBB
We ran association tests between each identified genes carriers and a list of representative phenotypes (full list can be found in Supplementary Tables 10 and 11) available in the UKBB using R v.4.0.2 including the same covariates we used in our genome-wide gene-burden tests. We also extracted the phenotypic associations with P < 0.05 for all genes we identified in our analysis from AstraZeneca PheWAS Portal67 (version: UKBB 470 K WES v.5; Supplementary Tables 18 and 19).
BMI and T2D GWAS lookup
Identified genes were queried for proximal BMI and T2D GWAS signals, using data from the largest published GWAS meta-analyses. For BMI, we used data from the GIANT consortium68, which includes data on up to 806,834 individuals. For T2D, we used data from the DIAGRAM consortium69, which included up to 428,452 T2D cases and 2,107,149 controls.
For each of these GWAS, we performed signal selection and prioritized causal GWAS genes using the ‘GWAS to Genes’ pipeline as described elsewhere18. The genes identified previously were annotated if their start or end sites were within 500 kb up- or downstream of GWAS signals in the two meta-analyses, using the National Center for Biotechnology Information RefSeq gene map for GRCh37, and overlayed with further supporting functional dataset information. For further details about the specific application of this method, see ref. 18.
Assessment for severe insulin resistance in carriers of IRS2 PTVs in a UK birth cohort
ALSPAC is a prospective birth cohort from the southwest of England that recruited >75% of all pregnancies delivered in the Greater Bristol area between 1990 and 1992 (refs. 70,71,72,73). The study has currently enrolled 14,833 unique women (G0 mothers), 3,807 G0 partners and 14,901 children. Full details of the cohort and study design are available at http://www.alspac.bris.ac.uk. Please note that the study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool (http://www.bristol.ac.uk/alspac/researchers/our-data/). Exome sequencing data from 8,605 children and 3,389 of their parents was interrogated for carriers of any high-confidence protein truncating variants in IRS2 as defined by LOFTEE with MAF <1%. Two such carriers were identified (Supplementary Table 14), and insulin and glucose measurements extracted from available data and insulin levels were compared to population specific reference ranges (2.5th and 97.5th centile) and commonly used clinical cut-offs for severe insulin resistance (<150 pmol l−1). Insulin was measured using either an ELISA (Mercodia) or an ECLIA (Roche)74. Details of exome sequencing, quality control, variant calling and annotation have been described in https://wellcomeopenresearch.org/articles/9-390/v1.
Lookup of effects of IRS2 PTVs on CKD-related traits in the AoU cohort
To provide supporting evidence of an effect of loss-of-function variants in IRS2 on CKD, we leveraged the results of a recent PheWAS conducted in the AoU cohort, publicly accessible using the ‘All by All web browser’ (https://allbyall.researchallofus.org/). The relevant gene page is available directly at https://allbyall.researchallofus.org/app?state=%7B%22regionId%22%3Anull%2C%22geneId%22%3A%22ENSG00000185950%22%2C%22resultIndex%22%3A%22gene-phewas%22%2C%22resultLayout%22%3A%22full%22%2C%22analysisId%22%3A%223027114%22%2C%22variantId%22%3Anull%2C%22burdenSet%22%3A%22pLoF%22%2C%22ancestryGroup%22%3A%22meta%22%2C%22phewasOpts%22%3Atrue%2C%22selectedContig%22%3A%22all%22%2C%22hideGeneOpts%22%3Afalse%7D.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Source link