Human population genetics seeks to understand the evolution and distribution of genetic variants within and between groups of individuals. Our lab’s research focuses on this, with an emphasis on the development of novel methods capable of handling large samples. Much of our recent work centers on characterizing relatives, which are abundant in large datasets, including in several recently generated and other ongoing studies. We also consider the processes that shape genomes, such as recombination. A few projects we have completed are below.
Relatedness inference: understanding current approaches
Close and even distant relatives share long stretches of their genome identically through inheritance from one of their common ancestors. These regions are termed identical by descent (IBD) segments. IBD segments give information about the likely the pedigree structure that a set of individuals are part of. At the same time, this pedigree often cannot be precisely identified, since, for example, a person shares the same average amount of their genome IBD (12.5%) with his/her first cousin as with a great-aunt/uncle. As a result, methods generally characterize individuals’ relationships into degrees of relatedness that represent a set of relationships with the same expected amounts of IBD sharing.
We analyzed 12 methods for classifying degees of relatedness between pairs of samples. This study revealed that the most accurate methods for inferring relatedness are those based on IBD segments, while methods that use only allele frequencies are less precise. We also found, somewhat intuitively, that characterizing close relatives, such as parent-child and full sibling pairs, can be done accurately (>99% correct) by nearly all methods.
Improved relationship inference using IBD sharing of sample ancestors—DRUID
DRUID (Deep Relatedness Utilizing Identity by Descent) is a method for inferring relatedness between individuals. DRUID takes advantage of the fact that characterizing relatedness between more closely related individuals is more accurate than doing so between more distant relatives. It combines IBD segments from a set of siblings to infer the IBD profile of a parent for whom no data are available. It can also leverage siblings together with their aunts/uncles to infer the IBD sharing profile of a grandparent of a set of samples. DRUID’s methodology is far more accurate than pairwise approaches and we also found that it better identifies relationships of samples than PADRE, another method that combines information from multiple samples to characterize relationships.
Recombination: Non-crossover gene conversion
Non-crossover gene conversion events (also called simply “gene conversion”) are one of the key factors involved in haplotype evolution, and are estimated to occur ~10 times more frequently than crossover. In recent work, we performed a genome-wide analysis of de novo gene conversion in human pedigree data and identified 107 sites affected by gene conversion events. Using these data, we estimated the rate at which a base is involved in a gene conversion event (5.9×10-6/base pair/generation), found evidence for extreme GC bias in gene conversion (68% of sites transmit G or C alleles rather than A or T), found wide variation in tract lengths (from ≤124 bp to over 1 kb), and strikingly, observed several examples of distinct gene conversion events that cluster in ~20-30 kb intervals.
Ongoing work in the lab centers on estimating the number and tract length of non-crossover gene conversion events, as well as deep analyses of GC bias and the clustering of gene conversions using whole genome sequence data from human pedigrees. We are also pursing questions related to the effects of such extreme GC bias on genome evolution.
Inferring haplotype phase in family datasets
Most methods for inferring haplotype phase in family data have runtime that scales exponentially in the number of individuals in the family. These methods cannot readily apply to very large family datasets. We developed the software HAPI for inferring haplotypes in family data, as described here. HAPI uses a novel state formulation that leverages the fact that real genetic data contain relatively few recombination events and, in so doing, obtains polynomial runtime on real genetic data. The problem of inferring haplotypes in family data has been shown to be NP-hard, but in practice the state formulation that HAPI uses enables it to merge an exponential number of states down to a small number of states for realistic inputs.
When analyzing a dataset containing 103 nuclear families, HAPI ran more than 300 times faster than other methods. Notably, when applied to a family with 11 children, HAPI used an average of 4.2 states per marker, with a maximum of 48 states at any marker. In contrast, other methods use 22c-2 markers, where c is the number of children in a nuclear family, and thus for an 11 child family, other methods build 1.0 million states per marker.
HAPI currently only handles nuclear families, but an ongoing project in the lab is extending the method to apply to general pedigrees so that haplotype-based genetic analyses of family data will not be computationally limited.
Local ancestry inference in Latinos
Local ancestry inference is the process of determining the continental ancestry group from which any position in an individual’s genome descends. Admixed populations such as African Americans and Latinos inherit segments of distinct ancestry, and methods have been developed for inferring ancestry across positions in the genome.
In order to infer local ancestry in Latinos, which are 3-way admixed, we developed an extension to HapMix (which only applies to 2-way admixed groups). This extension is described in the supplement to the 1000 Genomes Phase I paper. Using this extension, we contributed to a breast cancer study in Latinas that identified a risk factor in the 6q25 locus; the paper describing this work is available here. The development of methods for inferring local ancestry is an area of ongoing study in the lab.