Research

Human population genetics seeks to understand the evolution and distribution of genetic elements within and between groups of individuals. Our lab’s research focuses on this analysis and includes the development of novel methods to make such inferences. Much of our recent work centers on inferring information about the ancestry of individuals and their relationships to others. We also consider the processes that shape genomes, such as that of recombination. Several of the methods we have developed are intended to apply to very large datasets that are now available, making possible analyses that were previously infeasible to do because of computational constraints. A few projects we have completed are below.

Relatedness inference: understanding current approaches

Close and even distant relatives share long stretches of their genome identical through inheritance from one of their common ancestors. These regions are termed identical by descent (IBD) segments. IBD segments give information about the likely the pedigree structure that a set of individuals are part of. At the same time, this pedigree often cannot be precisely identified, since, for example, a person shares the same average amount of IBD—12.5% of the genome—with his/her first cousin and great-aunt/uncle. As a result, methods generally characterize individuals’ relationships into degrees of relatedness that represent a set of relationships with the same expected amounts of IBD sharing.

We analyzed 12 methods for classifying relationships between pairs of samples. This revealed that the most accurate methods for inferring relatedness are those based on IBD segments compared to those that use allele frequencies. We also found, somewhat intuitively, that characterizing close relatives, such as parent-child pairs and full siblings, can be done accurately (>99% correct) by nearly all methods.

Improved relationship inference using IBD sharing of sample ancestors—DRUID

Figure 1A from Ramstetter et al. (2018). DRUID infer the IBD sharing of an ungenotyped ancestor.

DRUID (Deep Relatedness Utilizing Identity by Descent) is a method for inferring relatedness between individuals. DRUID takes advantage of the fact that characterizing relatedness between more closely related individuals is more accurate than doing so between more distant relatives. It combines IBD segments from a set of siblings to infer the IBD profile of a parent for whom no data are available. It can also leverage siblings together with their aunts/uncles to infer the IBD sharing profile of a grandparent of a set of samples. DRUID’s methodology is far more accurate than pairwise approaches and we also found that it better identifies relationships of samples than PADRE, another method that combines information from multiple samples to characterize relationships.

Recombination: Non-crossover gene conversion

Figure 1c from Williams et al. (under review; on bioRxiv): locations of de novo gene conversion events plotted on the genome. Blue and red arrows indicate male and female transmissions, respectively.

Figure 1c from Williams et al. (2015): locations of de novo gene conversion events plotted on the genome. Blue and red arrows indicate male and female transmissions, respectively.

Non-crossover gene conversion events (also called simply “gene conversion”) are one of the key factors involved in haplotype evolution, and are estimated to occur ~10 times more frequently than crossover. In recent work, we performed a genome-wide analysis of de novo gene conversion in human pedigree data and identified 107 sites affected by gene conversion events. Using these data, we estimated the rate at which a base is involved in a gene conversion event (5.9×10-6/base pair/generation), found evidence for extreme GC bias in gene conversion (68% of sites transmit G or C alleles rather than A or T), found wide variation in tract lengths (from ≤124 bp to over 1 kb), and strikingly, observed several examples of distinct gene conversion events that cluster in ~20-30 kb intervals.

Part of Figure 4a from Williams et al. (under review; on bioRxiv): gene conversion events shown in red cluster within relatively short intervals. Plot spans ~30 kb.

Part of Figure 4a from Williams et al. (2015): gene conversion events shown in red cluster within relatively short intervals. Plot spans ~30 kb.

Ongoing work in the lab centers on estimating the number and tract length of non-crossover gene conversion events, as well as deep analyses of GC bias and the clustering of gene conversions using whole genome sequence data from human pedigrees. We are also pursing questions related to the effects of such extreme GC bias on genome evolution.

Inferring haplotype phase in family datasets

Haplotype transmissions in nuclear family with 11 children. The father's haplotypes are on the left in 11 colored columns, and the mother's transmissions are on the right. Single columns represent individual haplotype transmissions to one child from one parent, and switches in color (from blue to red and vice versa) are recombination events.

Haplotype transmissions in nuclear family with 11 children. The father’s haplotypes are on the left in 11 colored columns, and the mother’s transmissions are on the right. Single columns represent individual haplotype transmissions to one child from one parent, and switches in color (from blue to red and vice versa) are recombination events.

Most methods for inferring haplotype phase in family data have runtime that scales exponentially in the number of individuals in the family. These methods cannot readily apply to very large family datasets. We developed the software HAPI for inferring haplotypes in family data, as described here. HAPI uses a novel state formulation that leverages the fact that real genetic data contain relatively few recombination events and, in so doing, obtains polynomial runtime on real genetic data. The problem of inferring haplotypes in family data has been shown to be NP-hard, but in practice the state formulation that HAPI uses enables it to merge an exponential number of states down to a small number of states for realistic inputs.

When analyzing a dataset containing 103 nuclear families, HAPI ran more than 300 times faster than other methods. Notably, when applied to a family with 11 children, HAPI used an average of 4.2 states per marker, with a maximum of 48 states at any marker. In contrast, other methods use 22c-2 markers, where c is the number of children in a nuclear family, and thus for an 11 child family, other methods build 1.0 million states per marker.

HAPI currently only handles nuclear families, but an ongoing project in the lab is extending the method to apply to general pedigrees so that haplotype-based genetic analyses of family data will not be computationally limited.

Local ancestry inference in Latinos

Figure 1a from Fejermen, et al. (2012). The x axis shows physical position in the genome; y axis is -log10 P-value for association at the site based on local ancestry information. Significant association occurs in the 6q25 region.

Figure 1a from Fejermen, et al. (2012). The x axis shows physical position in the genome; y axis is -log10 P-value for association at the site based on local ancestry information. Significant association occurs in the 6q25 region.

Local ancestry inference is the process of determining the continental ancestry group from which any position in an individual’s genome descends. Admixed populations such as African Americans and Latinos inherit segments of distinct ancestry, and methods have been developed for inferring ancestry across positions in the genome.

In order to infer local ancestry in Latinos, which are 3-way admixed, we developed an extension to HapMix (which only applies to 2-way admixed groups). This extension is described in the supplement to the 1000 Genomes Phase I paper. Using this extension, we contributed to a breast cancer study in Latinas that identified a risk factor in the 6q25 locus; the paper describing this work is available here. The development of methods for inferring local ancestry is an area of ongoing study in the lab.