GUEST POST – What a Difference a Phase Makes

The following is a guest post by Ann Turner, founder of the of the GENEALOGY-DNA mailing list at RootsWeb and co-author (with Megan Smolenyak) of “Trace Your Roots with DNA: Using Genetic Tests to Explore Your Family Tree.” Thank you Ann for this terrific post!

Genetic genealogists use autosomal DNA testing to locate people who share some DNA, enough to point to a relationship in a genealogical time frame. We’re not impressed by accidental matches that occur simply because all humans share 99.9% of their DNA. We want to be confident that the shared DNA segment is Identical by Descent (IBD) from a particular common ancestor, one who lived some number of generations in the past.

Two practical difficulties stand in the way of definitive confirmation. One is that our pedigrees are not complete, and we cannot test every link in the chain to prove that the segment traveled down the pathway we’ve identified through the paper trail. Indeed, as more data accumulates we frequently discover that a match we attributed to one ancestor must have come through an entirely different line.

The other difficulty is that the testing method has its own limitations. The inner construction of the reported DNA segment is cryptic because the DNA is chopped into tiny pieces before analysis. The test can tell whether a certain allele (variant) is present or absent in the soup, but it cannot ascertain whether the father or mother contributed the allele. The results are reported as a genotype, with the two alleles listed in an arbitrary order – that is, the results are unphased. A marker is counted toward a match if it is at least half-identical, with at least one allele in one party matching at least one allele in the other. That obviously leaves a lot of wiggle room, so many consecutive markers telling the same story are needed to rule out coincidence. Even so, there may be some false positives, where an apparent segment is actually a pseudo-segment, a blend made up of small bits and pieces from the paternal and the maternal side.

Phasing (identifying which alleles all came from one chromosome, whether paternal or maternal) can reduce the number of false positive matches. Phasing is not difficult to accomplish with raw data from a child and both of his parents. Family Tree DNA (FTDNA) is the only company to accept raw data uploads from other companies (AncestryDNA and 23andMe), so out of curiosity, I constructed a synthetic file for my son. It mimicked a transfer file but listed just the alleles he inherited from me, presented as a homozygous genotype. Thus a segment would not receive a helping hand from the paternal allele if the maternal allele was a mismatch.

My son’s regular kit has 318 matches in common with me (excluding known close relatives). Would this new synthetic file reveal some false positives? To my surprise, the number of matches dropped precipitously, down to 32. That would amount to a 90% false positive rate, which beggars belief.

The explanation must lie elsewhere. Closer examination of the results reveals a likely candidate. The 32 matches persevering in the phased file had 56 segments under 5 cM in size, while the same 32 matches in the genotype file had 351 segments. That’s a drop-out rate of 84%, an indication that most small segments are actually pseudo-segments. It’s quite possible that the drop-out rate would be even higher if the other party to the match had phased data.

The disappearance of small segments is not limited to matches in the distant relative category. Figures A and B show an uncle compared to me, my son’s genotype file, and my son’s phased file, the first with the 5 cM threshold and the second with the lower threshold. The larger segments behave much as expected: my son inherits some but not all of the segments where I match my uncle, with a portion of segments being reduced in size due to recombination. In contrast, many of the smaller segments appear only in my son’s genotype file.

FTDNA’s algorithm for declaring a match is proprietary, but from anecdotal reports, it seems to require a total of 20 cM, including small segments. Coincidence or not, the only matches remaining in the phased file had a total of 20 cM or more.

The dependency on small segments may account for some other anomalies. For instance, my son has 115 out of 733 matches (15.7%) that don’t appear in either parent, so they are either false positives in the child or false negatives in the parent. Other people have reported figures of 20% or more. Would this figure be reduced if the algorithm were tweaked a bit to reduce reliance on small segments? (Side note: customers at 23andMe experience a similar effect, but it is mainly due to the cap on the number of relatives. A match in the child’s list may have scrolled off the list of the parent.)

Another example is a failure to triangulate results where expected. FTDNA provides a handy matrix tool for pair-wise comparisons. I have a case where I match nine people on the same segment. They all show as matches to each other in the matrix. But my sister, who is half-identical with me for the segment in question, matches only seven of those. Is it possible that the missing ones actually do share the segment of interest but fail on the secondary requirements to declare a match?

A third scenario might be an effect on admixed populations, such as African Americans, who find very few matches in the FTDNA database. The pseudo-segments may be easier to patch together when both parties are coming out of the same general gene pool with similar allele frequencies. Would African Americans see some new matches if the 20 cM requirement were eliminated? After all, those small segments don’t add much in the way of substantive information if they disappear in a phased file.

Do we even care about false negatives? Most of us have plenty of leads to keep us busy as it stands. Changing an algorithm might very well result in increasing the number of false positives. There is always a trade-off, and the goal of avoiding false positives is greatly to FTDNA’s credit.

Yet it is not just a matter of missing a magical match that would help us break through a brick wall. False negatives can lead to erroneous conclusions. This affects everyone, but it is especially critical for people who can test one parent but are searching for clues to the other parent (e.g. donor-conceived children or adoptees who have found their birth mother). They can’t positively conclude that a match is for the other parent just because it doesn’t show up in the parent they tested.

The best policy (which applies to all testing companies) is to keep an open mind about the possibility of false positives and false negatives.

Uncle_niece_great-nephew_great-nephew-phased_5cM

FIG. A

Uncle_niece_great-nephew_great-nephew-phased_1cM

FIG. B

 

 

 

 

 

.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

d

4 Responses

  1. Dianna Harbin 31 July 2015 / 7:47 pm

    Thanks for addressing issues concerning African Americans and other highly admixed populations as well as false negative result reporting! Both are not addressed often enough.

Leave a Reply

Your email address will not be published. Required fields are marked *