General questions

What can we do with these genetic score models? These genetic score models can be used to predict levels of biomelecular traits in genotyped cohorts. The predicted levels can be associated with complex phenotypes, which offers as a useful tool to investigate the molecular underpinnings of these phenotypes. The predicted levels can also allow integrative analyses with other available biomelucular traits in the cohort.

How can I download model files of these genetic scores? You can find a download link (named "Download model files") on the Score page of each platform. Following the link, you will be directed to a cloud drive page where we hosted model files of all the considerred traits for the platform. There is a "Download" bottom at right-top corner of the page, which will allow you to download all the model files in bulk. Or you can choose the model file of a trait you are intersted in and download the particular model file in a similar way. Please note that you can only download model files of gene expression traits in bulk as we have compressed them to a single file.


Genetic score development

What method was used for genetic score development and why? The machine learning method Bayesian Ridge (BR), that based on individual-level genotype data, was used to construct genetic scores of biomelecular traits in the Atlas. The selection of BR is based on the results in one of our previous studies that benchmarked the performance of a variety of representative genetic scoring methods for the construction of numerous continuous molecullar traits, and demonstrated BR was the top performing method in terms of both efficacy and efficiency.

How were the genetic variants (i.e. SNPs) selected before feeding to the genetic scoring method? To ensure the generalizability of genetic score models when applied to other cohorts, a variant filtering step was first performed for all the traits considered, which applied a MAF threshold of 0.5% and excluded all multi-allelic variants as well as ambiguous variants (i.e. A/T, G/C). A follow-up LD thinning step was carried out at an r2 threshold of 0.8 on all the variants, which aims to remove a certrain level of LD dependencies among variants and reduce the computational burden of genetic scoring method. The remaining variants were then filtered at the genome-wide significance threshold of 5e-8 (based on their GWAS summary statistics conducted on the INTERVAL training samples) for each trait.

How were traits selected for genetic score development in each platform? We selected traits that have at least one genetic variant with p-value < 5e-8 in their GWAS (based on the INTERVAL training samples) to allow running of the genetic scoring method.


Genetic score validation

How was the internal validation done? The INTERVAL training samples of a trait were randomly and equally partitioned to five portions, from which any four portions are used to learn a genetic score model of the trait with Bayesian ridge regression, and the model’s performance was then tested on the remaining 20% of INTERVAL training samples, i.e. calculating the r2 score and Spearman correlation coefficient between the predicted genetic scores and the actual levels of the trait for these samples.

How was the external validation done? The genetic score model trained with INTERVAL training samples for a trait was used to calculate genetic scores of the validation samples (external cohorts or withheld INTERVAL samples). Then r2 score and Spearman correlation coefficient were calculated using the predicted scores of these samples against their acutal trait levels.