Next: , Previous: Creating Structure Plots, Up: Top


4 PCA Plots

4.1 Input data format

Genesis takes as input one mandatory file, and one optional file:

We have scripts that convert from other popular PCA formats (PLINK, flashpca) to a format the Genesis understands. These scripts are discussed in the section Advice on data formats below. We hope that in future versions of Genesis that this will be handled natively.

4.2 Data input

To input Eigenstrat files, click File, click New PCA or the New PCA button on the toolbar. On the screen that opens, click Import Data File and navigate to the PCA data file outputted by the Eigenstrat software. Then optionally click Import Phenotype File and navigate to the phenotype data file. To input SNPRelate Data, click File click New PCA or the New PCA button on the toolbar. On the screen that opens, click Import Data File and navigate to the PCA data file outputted by the Relate package. The relate package file includes the phenotype information in the data file.

In the drop-down menus, select the 2 or 3 PCAs to plot as the axes and select the column of the phenotype file that will be used to group the data. To draw the graph, click Finish or click Next to access the Appearance Options menu (See below...).

4.3 Annotating the Graph

4.4 Hiding Subjects and Population Groups from the Graph

4.5 Searching for a Subject

To search for a subject in the graph by name, click the Search for individual button in the toolbar. In the dialog, enter the Name (first, last or both) of the individual you wish to find and click Ok. If the individual was found in the data, it will be selected and the subject dialog for that individual will open. If the individual was not found, a message will display.

4.6 Rotating the Graph

To rotate a 3D PCA plot, click the Show/Hide 3D PCA Rotate Panel button in the toolbar. This will bring up the rotate panel which contains a slider which can be dragged to rotate the graph about the z-axis.

4.7 Advice on data formats

Eigenstrat is directly supported by the Genesis.

4.7.1 SNPRelate

The SNPRelate R package of Zheng et al [2012] can be used to do PC-analysis. However, since it is an R-package there is no SNPRelate default format since output is fully programmable in R. We support the following output: a file that contains the eigenvalues, followed by the eigenvectors, produced using the following R commands.

pca <- snpgdsPCA(genofile,snp.id=snpset)
write.table(pca$eigenval,"pca.rel",sep="\t",quote=FALSE)

tab1 <- data.frame(sample.id = pca$sample.id,
     pop = factor(pop_code)[match(pca$sample.id, sample.id)],
     EV1 = pca$eigenvect[,1],
     EV2 = pca$eigenvect[,2],
     EV3 = pca$eigenvect[,3],
     EV4 = pca$eigenvect[,4],
     EV5 = pca$eigenvect[,5],
     EV6 = pca$eigenvect[,6],
     EV7 = pca$eigenvect[,7],
     EV8 = pca$eigenvect[,8],
     EV9 = pca$eigenvect[,9],
     EV10 = pca$eigenvect[,10],
     stringsAsFactors = FALSE)

write.table(tab1,"pca.rel",sep="\t",quote=FALSE,append=TRUE)

4.7.2 flashpca

FlashPCA is designed to perform PCA on very large data set. It takes as input a plink BED and BIM file and produces eigenvectors or principal components. We have a script flashpca2evec which converts the data into a format that Genesis can read. Because the flashpca output has no information about the sample IDs, flashpca2evec also needs the fam file as input. This script requires Python 2.7.

By default, flashpca calls its output files eigenvalues.txt and eigenvectors.txt and this is (by default) what flash2pca expects. For example:

flashpca2evec --fam data.fam --out data.evec

However, if the files have other names, the appropriate flags can be used

flashpca2evec --fam data.fam --eigenval file1.evals --eigenvec sample.csv --out data.evec

4.7.3 plink2evec

PLINK 2 [Purcell and Chang 2014] (and its alpha release plink 1.9) supports PCA directly. Genesis can handle these files natively but assumes that the default plink's default naming convention is used (e.g., a .eigenvec suffix). If this is not followed, Genesis will not be able to recognise the file type. Thus plink2evec is bundled for that purpose.

plink2evec converts the plink output files into the format that Genesis can read.

By default PLINK calls its output files plink.eigenval and plink.eigenvec and this is (by default) what plink2evec expects. For exmaple:

plink2evec --out result.pca.evec

However, if the files have other names, the appropriate flags can be used

plink2evec --eigenval file1.evals --eigenvec sample.csv --out data.evec

And if as is common in plink usage, the eigenvectors and eigenvalues file was specified by using the plink --out flag, then plink2evec can use its --bfile flag

plink --bfile sample --pca --out sample
plink2evec --bfile sample --out sample.pca.evec