Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using TrajGWAS for large-scale datasets: how to improve performance? #48

Open
parekhpravesh opened this issue Sep 5, 2024 · 3 comments

Comments

@parekhpravesh
Copy link

I would like to run TrajGWAS on some large-scale longitudinal phenotypes. Specifically, I have 100,000 observations, 48 covariates (+ intercept), and 100 phenotypes. I would like to get effect size estimates as well (so running a Wald test)

As an example, I ran TrajGWAS for one phenotype. I start Julia with: julia --threads 64 and then do:

trajGWAS(@formula(y ~ 1 + X1 + X2 + ... + X48),
@formula(y ~ 1),
@formula(y ~ 1 + X1 + X2 + ... + X48),
:id,
path_to_csv_file,
path_to_plink_file,
pvalfile = p_output_name,
nullfile = null_output_name,
covrowinds = covrowmask,
genetic rowinds = geneticrowmask,
parallel = :true,
test = :wald)

I am doing this as a slurm job with --cpus-per-task=64 and mem-per-cpu=7G specifications. Julia version: 1.10.0

However, after about 22 hours, only about 700 SNPs have been written to the output file. This is quite a bit slow and I wonder if there are any suggestions on how to make this efficient? Perhaps I am not specifying parallelisation correctly?

@parekhpravesh
Copy link
Author

Hello, just following up on this - I tried the same settings and after 3+ days of computation, only about 3000 SNPs were written to the output file. Do you have any tips/suggestions on how the performance can be improved?

@kose-y
Copy link
Member

kose-y commented Sep 19, 2024

Oh, sorry for the late response. The Wald test, giving the effect sizes, is much slower than the score test, which does not give the effect sizes. Our suggestion is first to screen the SNPs with the score test and take a subset of SNPs with low p-values, then compute the effect sizes using the Wald test only for the selected SNPs.

@parekhpravesh
Copy link
Author

Thank you - I tried running a score test and could finish the analysis in ~35 hours - could you confirm if I am specifying the parallelisation option correctly? Or is everything implemented for single threaded computation and it doesn't really matter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants