I have often lamented the lack of genetic genealogy studies based on French Canadians. As a group, they are a remarkable example of a founder population and the detailed records kept by the Catholic priests mean that it is a population where genealogy and genetics meet. A recent paper published in the European Journal of Human Genetics on a study of French Canadian DNA has implications for all genetic genealogy but especially for those from inbred populations including Colonial Americans. The researchers working jointly from the University of Montreal, the University of Quebec, Chicoutimi, and the University of Ottawa takes advantage of that rare combination. Written by Héloïse Gauvin et. al., the paper, titled, “Genome-wide patterns of identity-by-descent sharing in the French Canadian founder population” was pointed out to me by CeCe Moore and can be found in it’s entirety here.
The paper’s abstract explains why the populations of New France are perfect for testing the quality of phasing engines and their support for FastIBD software.
In genetics the ability to accurately describe the familial relationships among a group of individuals can be very useful. Recent statistical tools succeeded in assessing the degree of relatedness up to 6–7 generations with good power using dense genome-wide single-nucleotide polymorphism data to estimate the extent of identity-by-descent (IBD) sharing. It is therefore important to describe genome-wide patterns of IBD sharing for more remote and complex relatedness between individuals, such as that observed in a founder population like Quebec, Canada. Taking advantage of the extended genealogical records of the French Canadian founder population, we first compared different tools to identify regions of IBD in order to best describe genome-wide IBD sharing and its correlation with genealogical characteristics. Results showed that the extent of IBD sharing identified with FastIBD correlates best with relatedness measured using genealogical data. Total length of IBD sharing explained 85% of the genealogical kinship’s variance. In addition, we observed significantly higher sharing in pairs of individuals with at least one inbred ancestor compared with those without any. Furthermore, patterns of IBD sharing and average sharing were different across regional populations, consistent with the settlement history of Quebec. Our results suggest that, as expected, the complex relatedness present in founder populations is reflected in patterns of IBD sharing. Using these patterns, it is thus possible to gain insight on the types of distant relationships in a sample from a founder population like Quebec.
As the researchers point out, phasing engines have rarely, if ever, been tested against a real population with known ancestry. Instead, simulations are normally run and while simulations are good, they aren’t the same as real data. Working with real populations the study was able to identify more accurate phasing engines and ended up selecting FastIBD, although they also liked Germline and IBDLD, but FastIBD lived up to it’s name and ran the data faster.
These are my take-aways from the paper:
1.Beagle (upon which 23andMe has built it’s phasing engine, Finch, for Ancestry Composition) does a poor job of correlating to actual French Canadian heritage and FastIBD or Germline would be better. This disparity likely extends to any inbred or founder populations including early Colonial Americans, Puerto Ricans and Ashkenazi Jews. It likely also includes people from isolated pockets such as the mountainous regions of Germany, France and Italy- unfortunately. It is possible that these implications extend beyond inbred populations and show that FastIBD is better overall for DNA pseudophasing. Please note that 23andMe does not currently use phased DNA for DNA-Relatives and their modifications to Finch were to allow it to incorporate new information on a rolling basis as their database size increased. Ancestry DNA does phase with a Beagle-derived engine before giving matches. They had modified it, but I can’t assess if that modification is sufficient to improve the matches they give – I’m sure that’s a trade secret! They said at a conference in January 2014 that they had a way to improve the accuracy of smaller matches, could they be switching to a new phasing engine? If so, I hope they read this paper which was published in December of 2013, just shortly before their announced plan to eliminate many of the ‘very low’ category false matches.
2. Using the best phasing engine (FastIBD) the researchers were able to use 2 cM segments productively but with the poorer phasing engines had to stay above 3.8 cM. This reinforces that FTDNA’s use of unphased sub 5 cM segments is problematic. I actually use only 4 cM segments and larger when considering cousinhood from the FTDNA matches which still seems a bit dicey on unphased results as 4 cM is only 0.2 cM above a poorly phased samples’ results. Not only does FTDNA use these smaller segments it has a mandatory 20 cM threshold that these questionable segments must exceed before they report someone as a cousin. I find this methodology Goofy with a big hat! This segment size limitation might also have implications for AC at 23andMe since they are using a poorer phasing engine. Perhaps only segments above 3.8 cM should be taken seriously. However, looking forward if any of these companies would adopt a more accurate phasing engine, such as FastIBD, genetic genealogists could then look even further back in time with more confidence by looking at smaller segments. Of course, that assumes that enough SNPs/cM were being tested to ensure confident segment identification.
3. The high incidence of relatedness between matches in the HLA region of Chromosome 6 is reinforced by this study and matches in that area should be given less weight. These are the regions that relate to immune function and are likely preserved because natural selection provides pressure to do so. These regions are scattered across the 6th chromosome and it’s easier to merely be wary of the closeness of all Chrom 6 matches. One could suss out all the exact regions currently known if determined but how far on each side of each HLA area is protected isn’t known. Perhaps the center portion of the large arm of the 6 is more normal and matches there might be treated the same as all other matches. This effect might also effect the X as it is known for immune function but that might be hard to figure out considering the odd nature of the X inheritance pattern already.
4. Total cM shared between two matches is a good indicator of total relatedness in inbred populations, (at least as far back 8 or 9 generations – which is the limit of the founding of Quebec) but isn’t great for determining the MRCA (or LCA as the authors call them.) Segment length might be a better measuring stick but only on phased samples otherwise segment overlap could be very misleading. The number of segments doesn’t seem to be useful for inbred populations as each segment could come from a different LCA. Unfortunately, we currently don’t have access to match information that is both phased (Ancestry DNA only) and gives segment length information (23andMe and FTDNA.) FTDNA is considering creating a phasing engine… but I don’t know if they have a strong enough programing department nor a hefty enough load of scientists working for them for them to create one independently. Also, they do not have a database that’s large enough to do the heavy lifting required for testing such a piece of software as their recent ‘New and slightly improved’ ethnicity calculator, “MyOrigins” demonstrates. [To read more on MyOrigins, please see this article:New FTDNA Ethnicity Calculator: myOrigins – They Bunted]
It is currently possible to greatly improve both ethnicity estimates and cousin matches at any company willing to convert to one of the three identified good pseudo phasing engines, Germline, IBDLD, or FastIBD. The authors also mentioned that adjustments might be possible to other software. But tweaking software to match real life conditions instead of simulation runs is challenging and a lot of work.
In the future, we may be able to look at 2 cM segments with confidence, but they are currently below the accuracy level of the phasing engines being used.
And finally, descendants of inbred populations still have the largest challenges for identifying our ancestors and correctly categorizing our cousins. Our best bet, although still uncertain due to possible segment overlap, is to go by segment length in cM to identify MRCA (or LCA) which is what we’re all interested in. [To read more on cM vs Mb please see this article: AncestryDNA – The cM Mb Disparity – Updated]