The lab’s broad research goals are two-fold: (1) the development of computational methods capable of leveraging massively sized datasets, and (2) the analysis of large scale genetic data to better understand human genetic history, evolution, and the genetic basis of human disease. Much of the lab’s research centers around haplotypes—series of genetic variants that exist on the same chromosome copy in an individual. Ongoing projects in the lab include studies of recombination (including meiotic non-crossover gene conversion), computational algorithms for inferring haplotypes in large datasets, local ancestry inference for in large sample sets of admixed individuals, and methods for identity-by-descent detection in human subjects.

Recombination: Non-crossover gene conversion

Figure 1c from Williams et al. (under review; on bioRxiv): locations of de novo gene conversion events plotted on the genome. Blue and red arrows indicate male and female transmissions, respectively.

Figure 1c from Williams et al. (under review; on bioRxiv): locations of de novo gene conversion events plotted on the genome. Blue and red arrows indicate male and female transmissions, respectively.

Non-crossover gene conversion events (also called simply “gene conversion”) are one of the key factors involved in haplotype evolution, and are estimated to occur ~10 times more frequently than crossover. In recent work, now published and available here, we performed a genome-wide analysis of de novo gene conversion in human pedigree data and identified 107 sites affected by gene conversion events. Using these data, we estimated the rate at which a base is involved in a gene conversion event (5.9×10-6/base pair/generation), found evidence for extreme GC bias in gene conversion (68% of sites transmit G or C alleles rather than A or T), found wide variation in tract lengths (from ≤124 bp to over 1 kb), and strikingly, observed several examples of distinct gene conversion events that cluster in ~20-30 kb intervals.

Part of Figure 4a from Williams et al. (under review; on bioRxiv): gene conversion events shown in red cluster within relatively short intervals. Plot spans ~30 kb.

Part of Figure 4a from Williams et al. (2015): gene conversion events shown in red cluster within relatively short intervals. Plot spans ~30 kb.

Ongoing work in the lab centers on estimating the number and tract length of non-crossover gene conversion events, as well as deep analyses of GC bias and the clustering of gene conversions using whole genome sequence data from human pedigrees. We are also pursing questions related to the effects of such extreme GC bias on genome evolution.

Inferring haplotype phase in large genotype datasets of unreleated individuals and trios/duos

Figure 3a from Williams et al. - Switch error rates of HAPI-UR 3x, HAPI-UR, and other methods decrease with sample size.

Figure 3a from Williams et al. – Switch error rates of HAPI-UR 3x, HAPI-UR, and other methods decrease with sample size.

Haplotype data are essential to numerous studies in population and medical genetics, including phylogenetic tree inference, genotype imputation (useful in genome-wide association studies), analyses of natural selection, and many others. Most available genetic data do not provide haplotypes, so computational and statistical methods have been developed to infer haplotypes from genotypes. Haplotype inference is computationally intensive, and, at the same time, benefits from the large sample sizes now available.

We developed the software HAPI-UR to infer haplotype phase in large datasets of unrelated and/or trio or duo samples. This work is described in an available paper. The model is computationally efficient while capturing many of the same properties used in models that address datasets with smaller sample sizes. HAPI-UR runs more than 18 times faster than other phasing methods when analyzing ~16,000 samples, and its achieves comparable or greater accuracy. The ability to leverage and benefit from large dataset sizes will be essential in the years ahead as sample sizes continue to grow.

Inferring haplotype phase in family datasets

Haplotype transmissions in nuclear family with 11 children. The father's haplotypes are on the left in 11 colored columns, and the mother's transmissions are on the right. Single columns represent individual haplotype transmissions to one child from one parent, and switches in color (from blue to red and vice versa) are recombination events.

Haplotype transmissions in nuclear family with 11 children. The father’s haplotypes are on the left in 11 colored columns, and the mother’s transmissions are on the right. Single columns represent individual haplotype transmissions to one child from one parent, and switches in color (from blue to red and vice versa) are recombination events.

Most methods for inferring haplotype phase in family data have runtime that scales exponentially in the number of individuals in the family. These methods cannot readily apply to very large family datasets. We developed the software HAPI for inferring haplotypes in family data, as described here. HAPI uses a novel state formulation that leverages the fact that real genetic data contain relatively few recombination events and, in so doing, obtains polynomial runtime on real genetic data. The problem of inferring haplotypes in family data has been shown to be NP-hard, but in practice the state formulation that HAPI uses enables it to merge an exponential number of states down to a small number of states for realistic inputs.

When analyzing a dataset containing 103 nuclear families, HAPI ran more than 300 times faster than other methods. Notably, when applied to a family with 11 children, HAPI used an average of 4.2 states per marker, with a maximum of 48 states at any marker. In contrast, other methods use 22c-2 markers, where c is the number of children in a nuclear family, and thus for an 11 child family, other methods build 1.0 million states per marker.

HAPI currently only handles nuclear families, but an ongoing project in the lab is extending the method to apply to general pedigrees so that haplotype-based genetic analyses of family data will not be computationally limited.

Local ancestry inference in Latinos

Figure 1a from Fejermen, et al. (2012). The x axis shows physical position in the genome; y axis is -log10 P-value for association at the site based on local ancestry information. Significant association occurs in the 6q25 region.

Figure 1a from Fejermen, et al. (2012). The x axis shows physical position in the genome; y axis is -log10 P-value for association at the site based on local ancestry information. Significant association occurs in the 6q25 region.

Local ancestry inference is the process of determining the continental ancestry group from which any position in an individual’s genome descends. Admixed populations such as African Americans and Latinos inherit segments of distinct ancestry, and methods have been developed for inferring ancestry across positions in the genome.

In order to infer local ancestry in Latinos, which are 3-way admixed, we developed an extension to HapMix (which only applies to 2-way admixed groups). This extension is described in the supplement to the 1000 Genomes Phase I paper. Using this extension, we contributed to a breast cancer study in Latinas that identified a risk factor in the 6q25 locus; the paper describing this work is available here. The development of methods for inferring local ancestry is an area of ongoing study in the lab.