During a phone call with AncestryDNA representatives this week (unfortunately I was not able to attend), numerous genealogists heard two major announcements:
- The AncestryDNA database has hit 18 million test takers (such great news!); and
- There are significant changes coming to our DNA match list.
The announcement started to appear on the DNA match list page yesterday:
Clicking on the link brings up information about the changes:
The changes to the DNA match list comprise the following:
- The number of shared segments should improve
From the announcement: “The DNA you share with a match is distributed across segments – short segments, long segments, or some combination of both. Our updated matching algorithm may reduce the estimated number of segments you share with some of your DNA matches. This doesn’t change the estimated total amount of shared DNA (measured in centimorgans/cM) or the predicted relationship to your matches.”
In Genetic Genealogy Tips & Techniques, we’ve noted that the number of shared segments is usually inflated. For example, I share 44 and 49 segments with my parents when in fact it should be exactly 23. It will be interesting to see if the number is 23 after the update. We haven’t found the number of segments to be particularly useful evidence anyway, in most cases.
2. Length of the longest shared segment will be provided
From the announcement: “The length of the longest segment you and a DNA match have in common can help determine if you’re actually related. The longer the segment, the more likely you’re related. Segment length is also the easiest way to evaluate the difference between multiple matches that all show the same estimated relationship. Our updated matching algorithm can show you the length of the longest segment you share with your matches.”
This is a surprising change! Knowing the length of the longest segment shared with a match will be enormously beneficial to test-takers with endogamous ancestry. It will allow them to identify, to a degree, matches that share only small pieces of DNA (and thus much older common ancestry) and matches that share at least one large piece of DNA (and thus more recent common ancestry).
Of course, the efficacy of this improvement depends entirely on the previous improvement, namely the ability to identify the true length of segments shared by a match.
3. Matches that share 6 to 7.9 cM will be eliminated
From the announcement: “Our updated matching algorithm will increase the likelihood you are actually related to your very distant matches. As a result, you’ll no longer see matches (or be matched to people) that share less than 8 cM with you – unless you have added a note about them, added them to a custom group or have messaged them. These changes to the matching algorithm will reduce the total number of DNA matches you have and the number of new matches you will receive. It may also affect the number of ThruLines you may see.”
This is the big one, and the focus of the remainder of this blog post!
There is an updated Matching White Paper available from Ancestry.
The (In)Validity of Small Segments
A very large percentage of small segments are not valid shared DNA, and half of the valid small segments are at least 20 generations old.
As many of you have seen before, I call small segments “poison.” I equate them to poison chocolate-covered candies (because I love non-poisoned chocolate covered candies). If I handed someone a bowl of candies and told them that I had poisoned 50% of them and the poisoned candies looked like every other candy in the bowl, no one would reach in and grab a candy. There are plenty of sources of information about the danger small segments, including “A Small Segment Round-Up.”
Unfortunately, the answer to that scenario is often “Yes, but…” There are no ‘buts’ when it comes to small segments. There is no test or application that can distinguish between false and valid small segments. There is no evidence that mapping small segments, triangulating small segments, or finding small segment matches in shared match groups increases the likelihood that a small segment is valid.
Phasing (separating DNA from a test into the test-taker’s paternal and maternal contribution) reduces false segments, but typically only one person in a match set is phased and the unphased individual may be matching due to a pseudosegment. Phasing also does not resolve the ‘age of the segment’ issue discussed below.
And yes, there are VERY limited scenarios when we CAN identify small segments as being valid, such as during visual phasing (showing recombination at the ends of the chromosomes, for example). Another scenario is where another family member shares a large segment with the match that encompasses the small segment that I share with the match (in which case we’re working with a large segment, not a small segment).
Part of the trouble with small segments is that they are enticing. We can find what looks like great data in the enormous pool of distant matches. For example, it is estimated that as much as 50% of our matches at AncestryDNA are in the 6-7.9 cM range. I have 83,108 matches in the range of 6 to 20 cM. Even conservatively taking 25% of those matches as being in the 6-7.9 cM range means 20,777 matches. And if half of those are based on false segments, that would be about 10,000 matches. How can I not find genealogical connections among those 10,000 matches? Unfortunately, finding the genealogical connection does not mean the segment is valid.
In addition to the validity problem, small segments can be very, very old. The best source of data we currently have on the possible age of segments is from the Speed & Balding paper which is linked and discussed on the Identical by Descent page on the ISOGG Wiki. For example, according to their simulations only 20% of valid 5 megabase shared segments (which is roughly equivalent to 5 cM) are within the past 10 generations. 50% are greater than 20 generations old.
Why is age important? Because when a segment is very old, that means it could have come from a very distant shared ancestor that we may not even know we share. Finding a shared segment and finding a shared ancestor are two separate pieces of evidence that we have to work to combine. That is often impossible to do when working with only small segments, since our trees are all woefully incomplete to the distance from which small segments can come.
But I have a confirmed cousin sharing 6 cM!
Because so many small segments are invalid, and because so many valid segments are so old, the amount of evidence necessary to establish that a small segment shared with a genealogical cousin came from the identified common ancestor is gargantuan.
First, there must be evidence that the small segment is valid. The genealogical connection is not that evidence; that’s circular reasoning (and there’s no science to support that conclusion).
Second, there must be evidence that the small segment came from the identified common ancestor and not from another, potentially unknown and very distant, shared line. This is probably the most challenging of the two issues given the potential age of small segments.
Without evidence for both issues, the genetic conclusion is unsupported.
And see my point above about my hypothetical 10,000 invalid matches. It is impossible for me not to find genealogical cousins among those matches. Believing that I beat the odds is confirmation bias.
But there are valid genealogical connections among those distant matches!
This is absolutely correct. There are very valid genealogical connections among distant matches, including among all the false matches. That the shared DNA is invalid does not invalidate the genealogical connection. This change will result in many genealogical connections being lost. For example, the following ThruLine is based on a small segment and used several trees to find a potential genealogical connection:
I have 69 matches in the range of 6-7 cM that have a Common Ancestor designation, out of 431 total (so 16%).
This information will potentially be lost when the 6-7.9 range is eliminated (although it might still be found among the trees at Ancestry). However, this is for “the greater good.”
Yet another problem with distant matches is that DNA test-takers have no idea that a large percentage of small segments are invalid. Test-takers that know this and use small segments cautiously represent a minuscule portion of the database. That means that most people that see the ThruLine above accept its validity, believe the genetic connection is valid, and conclude that they’ve proven their descent from the ThruLine ancestor. Obviously that’s not the end of the world. But why not prevent these incorrect conclusions when there’s a way to do so?
Losing these genealogical connections is the price we pay to protect current and future test takers from relying on false data. And we have so many other genealogical connections to pursue among our valid DNA matches!
But small segments can identify biogeographical origins!
Small segments have been used to potentially identify very old biogeographical origins. For example, many people with African ancestry have found African matches and this may point to biogeographical origins.
While I hypothesize that it might be possible to point to very high-level biogeographical origins with small segments, it’s important to acknowledge that this hasn’t been demonstrated with any scientific evidence. For example, I haven’t seen an analysis of someone with African ancestry versus someone without African ancestry to see if only one or both can identify African matches. Of course that doesn’t invalidate the approach, it only means we must proceed cautiously.
Indeed, it will be important to consider and study how this might affect people with African or other historically marginalized ancestry. All the available science about small segments indicates that this change will actually improve the method by weeding out false data and preventing incorrect conclusions. Additionally, as more Africans test the number of larger shared segments will increase. Those of us that have been genetic genealogists for a long time have seen several instances of DNA results provided to people of African descent that were later discovered to be incorrect as testing improved and databases grew. Hopefully this change prevents that from happening.
There is a GREAT blog post from Tracing African Roots about the dangers and potential rewards of this approach: “How to find those elusive African DNA matches on Ancestry.”
The road ahead
Regardless of the known issues with matches in the range of 6-7.9 cM, I know that people want to and will continue to use the genealogical data that is currently among some of these matches (such as Common Ancestor hints). You can retain matches in this range by doing one of
three four actions:
- Add a note in the match note field;
- Add them to a custom group using the colored dot system;
- Message them; or
- Add a star [this was added by Ancestry on July 18th].
We don’t have an exact date for the match update, only that it will happen in early August.
Blog post round-up
Randy Seaver at Genea-Musings has a round up of other blog posts about the change: “AncestryDNA Changes Coming Soon – What I’m Doing.”