Skip to content

globalbiobankmeta/META_ANALYSIS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Global Biobank Meta-analysis Initiative meta-analysis workflow

This repository is used for Global Biobank Meta-analysis Initiative GWAS meta-analysis. The scripts were updated based on scripts used in COVID-19 Host Genetics Initiative GWAS meta-analysis. WDL workflows and Google Compute Engine are used for computing. The workflows consist of cleaning/munging input files to the same format and running a meta-analysis.

Marker-level post-GWAS QC

We need to use the gnomAD v3 data. In step 1.1 and 1.2, we only use the GWAS summary statistics file with the largest sample size from each biobank by ancestry, e.g. asthma in the flagship project, to extract the variant lists that failed QC using gnomAD as a reference. With these lists, in 1.3, post-GWAS marker level QC is performed for all GWAS summary statistics files from indiviudal biobanks.

1.1 (This step is done for asthma GWAS sum stats files, which contain the fullest list of genetic markers from each biobank per ancestry) The WDL workflow in wdl/munge_sumstats_beforeQC.wdl and wdl/munge_sumstats_beforeQC.json is used to filter and convert submitted SAIGE summary stat files to a unified format. INFO and AF filtering are done to the each summary stat file. Stats in build 37 are automatically lifted to build 38. Alleles are harmonized (matching ref/alt alleles, effect direction) using gnomAD 3.0 genomes as reference and fold change of AF to gnomAD AF for the population is added to the stats. Chromosomes are renamed so that e.g. "chr1" and "01" become "1" and "X" becomes "23". Scientific notation is converted to decimal notation for base pair positions. The output of this step is for each study a bgzipped tab-delimited summary stat file and its tabix index, as well as manhattan and qq plots and AF-gnomAD_AF plots.

1.1.1 Before running the above workflow, scripts in scripts/format can be used to format non-SAIGE summary stats to SAIGE format.

1.2 (This step is done for asthma GWAS sum stats files, which contain the fullest list of genetic markers from each biobank per ancestry). We compare the allele frequencies of genetic variants in individual biobanks/cohorts to those in gnomAD using the WDL workflow in wdl /munge_sumstats_beforeQC_obtain_QClist.wdl and wdl/munge_sumstats_beforeQC_obtain_QClist.json based on scripts and the analysis programs in https://github.com/globalbiobankmeta/PLOTS/tree/master/plot_scripts. Two lists of variants containing variants with different allele frequencies compared to gnomAD and variants with strand ambiguity, respectively, will be output

1.3 (This step is done for all GWAS sum stats files per biobank per ancestry with the two lists of variants output from 1.2). The WDL workflow in wdl/munge_sumstats.wdl and wdl/munge_sumstats.json is used to filter and convert submitted SAIGE summary stat files to a unified format. INFO and AF filtering are done to the each summary stat file. Stats in build 37 are automatically lifted to build 38. Alleles are harmonized (matching ref/alt alleles, effect direction) using gnomAD 3.0 genomes as reference and fold change of AF to gnomAD AF for the population is added to the stats. Chromosomes are renamed so that e.g. "chr1" and "01" become "1" and "X" becomes "23". Scientific notation is converted to decimal notation for base pair positions. The output of this step is for each study a bgzipped tab-delimited summary stat file and its tabix index, as well as manhattan and qq plots and AF-gnomAD_AF plots.

Meta-analysis

The WDL workflow in wdl/meta.wdl, wdl/meta.sub.wdl and wdl/meta.json is used to run meta-analysis with inverse-variance weighted betas. The analysis program is in scripts/meta_analysis.py. The output is a bgzipped, tab-delimited summary stat file with summary stats of each individual study and meta-analysis stats across all studies that have each variant. Leave-one-out meta-analysis can also be performed so that stats are available using all but one study in the meta-analysis. Manhattan and qq plots are also created.

Genome-wide significant loci are defined using https://github.com/globalbiobankmeta/Loci_Definition

Pipelines for other analyses

PC projection: https://github.com/globalbiobankmeta/pca_projection

PRS: https://github.com/globalbiobankmeta/PRS/blob/main/run_prscsx_pipe.md

Trans-ancestry proteome Mendelian randomization: https://github.com/globalbiobankmeta/multi-ancestry-pwmr

About

Tools for doing x way meta-analysis

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 56.0%
  • WDL 34.0%
  • Shell 9.4%
  • Other 0.6%