Sheryl, a member of the Genetic Genealogy Tips & Techniques group (which just broke 50,000 members!) recently commented on a thread about shared DNA outliers about a situation within her own family. I thought it would be a great opportunity to discuss outliers and how to deal with them. Sheryl kindly agreed!
For background, we examined an outlier situation once before on this blog, where second cousins once removed (2C1R) did not share DNA (see “Analyzing a Lack of Sharing in 2C1R Relationship“).
Identifying an Outlier
Sheryl indicated that she and her mother Grace appeared to be outliers with Sally, their first cousin (1C) and first cousin once removed (1C1R), respectively. Grace shared 482 cM with her 1C Sally, and Sheryl shared 215 cM with her 1C1R Sally. Not surprisingly, Grace and Sheryl share an expected amount for mother/daughter:
If you’ve worked with DNA testing for a while, or you’ve tested many relatives, you start to get a feel for the amounts of DNA that relatives of various relationships should share. For example, hearing someone shares 482 cM with an expected first cousin automatically raises red flags.
However, you don’t need to memorize expected amounts of shared DNA for every relationship. There are online resources and tools that allow you to look up ranges and probabilities for most genealogically relevant relationships.
Using the Shared cM Project
The August 2017 version of the Shared cM Project, for example, has a graphic with the ranges and averages for relationships through 8C based on 25,000+ submissions (click to enlarge):
But that’s not all! The Shared cM Project has a PDF with histograms, breakdowns by company, and other analyses that didn’t fit onto the graphic. I cannot encourage you too strongly to READ THE FULL PDF FOR THE SHARED CM PROJECT!
For the August 2017 Version of the Shared cM Project, there were a total of 1,512 submissions for 1C relationships. That’s a lot of submission! The 99th percentile range for 1C was 553 cM to 1225 cM, with an average of 874 cM:
The Shared cM Project PDF also has a histogram for the 1C relationship submissions, which is a graph that shows the distribution of these submissions (I added the red arrow manually):
Along the bottom are various “bins” or buckets with a small cM range (such as 870 to 915 cM) in the middle. The height of the bin is the number of submissions that are located within that bin or bucket. The 870-915 bin, for example, has 192 submissions. Because there are so many 1C submissions, the histogram shows a beautiful bell curve distribution, just as we would expect.
As shown by the red arrow, however, sharing of 482 cM by suspected 1Cs Grace and Sally falls far outside the range for 1C shown by this chart. Thus, a result of 482 cM is an OUTLIER according to the Shared cM Project. An outlier is “a statistical observation that is markedly different in value from the others of the sample,” according to Merriam-Webster. In other words, an outlier is a shared cM amount for a genealogical relationship that falls outside an expected range. Indeed, 482 cM falls far outside the range of 553 to 1225 cM.
Using the Relationship Tools at DNA Painter
In addition to the Shared cM Project, there are interactive tools at DNA Painter that allow you to evaluate a shared cM amount (see The Shared cM Project 3.0 tool v4). To use this tool, simply enter a shared cM amount in the empty field (shown by the manually-added red arrow below):
Entering 482 cM, for example, produces a chart of probabilities:
These probabilities were generated by Leah Larkin (The DNA Geek) using Figure 5.2 of the AncestryDNA Matching White Paper published on 31 March 2016. Leah used some elegant analysis to extract these probabilities, and programmer extraordinaire Jonny Perl of DNA Painter converted that information into this interactive probability tool.
As shown in the probability chart, there is a 3.92% probability that 482 cM is a 1C. Due to the slight mismatch between the Shared cM Project and The DNA Geek probabilities, there is a note indicating that “this relationship has a positive probability for 482cM in thednageek’s table of probabilities, but falls outside the bounds of the recorded cM range (99th percentile).
Looking at The DNA Geek extracted probabilities and ranges, the range for 1C is approximately 440 cM to 1500 cM. Thus, according to The DNA Geek probabilities a result of 482 cM is NOT an outlier. However, it is still extremely low and should be treated with caution; indeed, it should still be treated as if it is an outlier.
But is 482 cM for Grace and Sally really an outlier result? Or are Grace and Sally not in fact first cousins? That is the true question!
Is it Outlier? The Extreme Danger of Confirmation Bias
The danger here, and what I find most people do, is to assume that this is indeed an OUTLIER. However, it is only an outlier if Grace and Sally are in fact first cousins. If they are actually another relationship for which 482 cM falls within the range, then the result is not an outlier.
Proceeding as if Grace and Sally are simply first cousins with an outlier result, or trying to prove they are 1C without trying to disprove it, is confirmation bias. It assumes an outcome and clouds judgment, potentially leading one to ignore or devalue contradictory evidence.
When a shared cM amount is very low or very high, a red flag is raised and we must do our best to resolve that red flag using a combination of documentary research and additional DNA testing. Anything else is confirmation bias.
Are Grace and Sally First Cousins? So Far, We CANNOT Know!
So, to recap, so far all we know is that Grace shares 482 cM with Sally, that Sheryl shares 215 with Sally, and that Grace and Sheryl share 3,485 cM. But we don’t know whether these are outliers or whether there is another explanation such as a different relationship.
Indeed, even if there is a VERY well-documented tree showing that Grace and Sally are first cousins, the fact that they share an unexpected amount of DNA (as demonstrated by a scientifically solid analysis) means that there is potentially a conflict between the tree and the DNA. It is our job as genealogists – professional problem solvers – to resolve this conflict with additional evidence.
How do we resolve the conflict with additional evidence? Easy! We go out and identify or generate that additional evidence.
To examine this possible outlier situation, we can formulate hypotheses to test. We generate a hypothesis by taking the information we have so far, scant as it may be, and formulate some educated guesses to explain the information.
We then try to disprove (NOT prove!) the hypothesis. If we disprove a hypothesis we can discard it. If we FAIL TO DISPROVE a hypothesis, it remains the most likely explanation for the evidence and may ultimately become our standing conclusion.
There are (at least) two competing hypotheses here, which are likely to be mutually exclusive:
- The first hypothesis is that Grace and Sally are indeed 1C and the 482 cM they share is indeed an outlier result.
- The second hypothesis is that Grace and Sally are another relationship other than 1C into which 482 cM more solidly fits, such as half 1C or 1C1R (which had probabilities of 88.88% according to The DNA Geek probabilities).
We could conceivably come up with other hypotheses, although these are by far the two most likely scenarios. Since we don’t have unlimited time, money, and resources, we can’t disprove every possible hypothesis and thus we stick to the most likely hypotheses. For example, a possible hypothesis is that this data was falsely placed by aliens to achieve some otherworldly-goal, but generally speaking we are going to ignore that hypothesis for every analysis!
How do we test a hypothesis? To test a hypothesis, we need new evidence (and/or to analyze old evidence in new ways).
In this scenario, we should gather evidence in two ways:
First, we must reexamine the documentary trail. Is there any suggestion or evidence in the documents that Grace and Sally are not 1C? How strong is the evidence that they are indeed 1C? Since we were not present for each of these four conceptions (shown by the red arrows), we don’t know how accurately the records reflect the genetic reality.
Second, we must obtain additional DNA evidence by testing other family members. There are two ways to do this.
The first way is for both Grace and Sally to look for random matches in the database to “Grandfather” and “Grandmother.” If a leading hypothesis is that Grace and Sally might be half 1C, then either Grace or Sally would not match the same Grandmother or Grandfather matches (since Grace and Sally’s parents would be half-siblings and thus Grace and Sally would not share either the same Grandmother or (more likely) the same Grandfather).
The second way, and more likely to yield stronger evidence, is targeted testing of close family members to examine the specific situation. For example, family members such as the grandparents, the parents, the siblings of the parents, and others could be very useful. In this particular example, the grandparents and parents are not living, however there are many other family members that can shed light on the possible outlier situation.
Adding More DNA Evidence to the Analysis
Among other relatives, Sheryl has tested her uncle Earl (brother of Grace) and her great-aunt Ann (aunt of both Grace and Sally). Importantly, Grace and Earl share 2,678 cM and thus are full siblings. Additionally, Earl and Sally share 578 cM.
NOT an outlier (hypothesis #2): If Grace and Sally are half 1C according to the second hypothesis, then their parents would be half siblings. Thus, Ann would likely be a half sibling to one and a full sibling to the other (she could also be a half-sibling to both of them, if there were three different mothers or three different fathers). We would expect Ann to match one niece as a full aunt/niece and the other niece as half aunt/niece. There’s no indication so far of which it might be.
Outlier (hypothesis #1): If Grace and Sally are full 1C according to the first hypothesis, then we would expect Ann to match both Grace and Sally as full aunt/niece. However, we might expect Ann to share less DNA than average with either Grace or Sally in view of the outlier situation.
Below I’ve charted the DNA shared between Ann and everyone else in the family using a McGuire Chart created using the McGuire Method (see “GUEST POST: The McGuire Method – Simplified Visual DNA Comparisons“):
Here, Ann shares 1754 cM with Sally, 1602 cM with Earl, and 1459 cM with Grace (and 698 cM with Sheryl), which is solidly within the full aunt/niece/nephew range. Below, the sharing between Ann & Sally and Ann & Grace is plotted on the histogram for Aunt/Uncle/Niece/Nephew from the Shared cM Project (shown by the manually-drawn red arrows):
When we look at the histogram for Half Aunt/Uncle/Niece/Nephew, we see that 1459, 1602, and 1754 cM all fall outside the range:
If we pop these amounts of shared DNA into the probability tool at DNA Painter, we see the following, namely that while 1754 cM shared with Sally and 1602 cM shared with Earl show a 100% probability of full nibling (i.e., full aunt/niece and full aunt/nephew), the 1459 cM shared with Grace shows a less than 5% probability of being a half aunt/niece:
In addition to Aunt Ann, Sheryl has tested other relatives including another 1C of Sally, Grace, and Earl:Here, the 1C is indeed a full niece of Ann at 1751 cM. However, for a 1C relationship of 790, 789, and 773, there is a small chance (less than 10% according to the DNA Painter probabilities) that they could be a half 1C, but interestingly Grace, Earl, and Sally all fall within a 17 cM range with the 1C. That would suggest it would be the 1C that could be a half 1C, if they are half 1Cs rather than full 1Cs.
For me, it’s Grace’s brother Earl that most likely disproves hypothesis #2 (i.e., that Grace and Sally are another relationship other than 1C into which 482 cM more solidly fits, such as half 1C or 1C1R), and fails to disprove hypothesis #1. Let’s recap the important facts:
- Earl and Grace share 2,678 cM and thus are likely full siblings;
- Earl shares 1602 cM with his Aunt Ann; and
- Cousin Sally shares 1754 cM with Aunt Ann.
If either Sally or Grace were a half niece to Ann, we would expect only one of them to share in the half Aunt/Niece range. While Grace’s sharing with Aunt Ann of 1459 cM could conceivably be a half Aunt/Niece relationship (although it would be a very extreme outlier according to the Shared cM Project), Earl would also have to be a half Aunt/Nephew relationship because Earl and Grace are full siblings. However, we see that Earl and Ann are solidly in the full Aunt/Nephew range, and very far outside the possible range for half Aunt/Nephew.
What do you think? Are there other (not crazy) hypotheses you would want to test?
What Does the Existence of Outliers Mean for the Shared cM Project?
Some are tempted to look at a result like 482 cM for a 1C, which is an outlier in the Shared cM Project, and declare the project to be flawed or incomplete. However, the Project utilizes many thousands of relationship submissions and statistical analysis to determine the best ranges. As more relationships are submitted the boundaries of the ranges are likely to change.
However, even with millions of submissions, there can still be outliers. Biology is a very random process (or we wouldn’t be here!), and thus there can always be outliers. Unfortunately, you see, statistics doesn’t care about the individual.
That’s why we, as genealogists, utilize statistics to generate testable hypotheses rather than incorrectly basing any conclusion on statistics alone.
This post is intended to provide helpful insight into how to approach possible outlier situations, including how to examine whether a result is actually an outlier or is an entirely different relationship. When faced with a possible outlier, it is important to formulate and then test several different competing hypotheses (in an attempt to disprove them!) with additional documentary research and DNA testing.
DNA evidence can be powerful, but only when it is used carefully and correctly.
Thus, if you test a first cousin that shares 500 cM with you and you point to this blog post to say that “it’s perfectly fine for 1Cs to share 500 cM,” you should go back and reread this post!