There has been a great deal of conversation in the genetic genealogy community over the past couple of weeks about the use of “small” segments of matching DNA. Typically, the term “small” refers to segments of 5 cM and smaller, although some people include segments of 7 cM or even 10 cM and smaller in the definition.
The question, essentially, is whether small segments of DNA can be used as genealogical evidence, and if so, how they can be used.
While it may seem at first that all shared segments of DNA could constitute genealogical evidence, unfortunately some small segments are IBS, creating “false positive” matches for reasons other than recent ancestry. These segments sometimes match because of lack of phasing, phasing errors, or a variety of other reasons. One thing, however, is clear: there is no debate in the genetic genealogy community that many small segments are false positive matches. There IS debate, however, regarding the rate of false positive matches, and what that means for the use of small segments as genealogical evidence.
Small Segments and Me
Because small segments are prone to be false positives, I personally take a very cautious approach. For example, when I download my list of shared segments from the Family Tree DNA, the very first thing I do with the spreadsheet is sort it by size and delete every segment smaller than 5 cM. For me, there is too much uncertainty surrounding small segments to base any conclusion on them.
While this may not be the answer for everyone, it is the responsibility of the genealogist to place a VERY high burden on any argument that utilizes these small segments, if only because of the known high rate of false positives. As we’ll see, a “maybe” is not good enough for a genealogist using DNA evidence in a proof argument.
What is the Rate of False Positives?
Because large databases of genomic data are still relatively new, most IBD studies have been performed using simulated data. In a study earlier this year using real genomes (including thousands of genetic genealogists) scientists from 23andMe examined IBD segments in a group of 25,000 individuals which included 3,000 father-mother-child triads.
In the study, a segment of DNA shared by a child with someone in the database other than their parents was declared to be IBD if that same segment of DNA was shared with a parent (importantly, the IBD segments only had to overlap by 80%, not a full 100%).
The researchers found that more than 67% of all reported segments shorter than 4 cM are false-positive segments (see FIG. 2B, below). At least 60% of 4cM segments were false-positive, and at least 33% of 5 cM segments were false-positive. The number of false-positives decreased fairly rapidly above 5 cM.
To my knowledge, this study is the largest of its kind to examine this question directly. The paper is available online for free (http://mbe.oxfordjournals.org/content/31/8/2212).
EDIT (4 December 2014) – Ann Turner has noted that the genomes used to generate the graph above were phased using the popular BEAGLE phasing program. The segment data we get from 23andMe and Family Tree DNA, on the other hand, is based on unphased genomes. This means that the false positive error rate for that segment data is almost certainly higher – and potentially MUCH higher – than what is reported in the chart above.
As support for this conclusion, John Walden has shared IBD v. IBS numbers from a private study (see http://www.isogg.org/wiki/Identical_by_descent). Now I provide these numbers only because so much of the community is familiar with them. However, because these numbers have never been peer-reviewed and the methodology has never been satisfactorily shared, I strongly recommend use of the 23andMe study above instead. That being said, here is what Walden found in his study, in which he examined whether segments found in children were (IBD) or were not (IBS) found in the child’s parents:
|cM||% IBD||% IBS|
According to Walden’s study, 95% of 5 cM segments were false-positive segments (not found in the parents). Again, I would take Walden’s numbers with a grain of salt. I feel much more confident in the large-scale 23andMe study, however.
So until another study is published, there can be little debate that the false positive rate for segments of 5 cM or smaller is likely somewhere between 33% and 95%.
If segments of 5 cM and smaller are largely or mostly false positives, how can we use these in our research? Is there any way to salvage these smaller segments?
Perhaps the only way, currently, to use small segments is to compare the small segments with immediate family members (parents), and even this method is extremely limited. The theory here is that if I share a small segment with a cousin, and my mother shares the same segment with the same cousin, then I inherited that segment from my mother and it is most likely a real segment (rather than IBS). But note that this theory is not ironclad!
The table below is segment data from Family Tree DNA for me and my father, showing where we each share segments of DNA with a genetic cousin, John Doe. All of my small segments are highlighted in blue. Of course, my father and I each share one “large” segment of DNA with John Doe, an 8.25 cM segment on Chromosome 1. All other segments that I share with John Doe are 3.3 cM or smaller.
When I compare these small segments to my father’s results, I see that of my eight small segments, only one (12.5%) is shared by John Doe, me, AND my father (my mother doesn’t match John Doe although theoretically I could have inherited one or more of these segments from her). The segment that John Doe, my father, and I share in common is very small, 1.71 cMs, and I do wonder if that could be the result of a “pile-up” or some other false positive state.
This process does not, however, safely extend to more distant family members and cousins, for example if my parents were not able to test. After all, John Doe is a cousin, and yet I share many, many false positive segments with him. I’m not certain how far I would comfortably extend this method.
Out of curiosity, I wondered how many other people I shared some of these small segments with, and whether my father shared that segment with any of those matches.
So, I found the first 1.42 cM false positive segment in a sorted spreadsheet of all my matches:
The list goes on and on like this; in fact, I share that same exact small segment with 48 other people. My father shares the segment with 2 people (not me), neither of whom match me. (My mother shares the segment with 19 people, 8 of whom share the segment with me). Therefore, my parents share this segment with 21 people, and I share it with 48 people (and only 8 of those people share the segment with me and with a parent, my mother). There is clearly something very wrong with this segment, but I only learned that by conducting this intensive analysis (that I normally would not do!).
In contrast, look at the “real” small segment. The sharing is much more limited, and John Doe is the only person that both my father and I match at that segment.
I do feel a bit better about this particular segment, but even so I could not use this segment alone as evidence (instead I’d rely on the larger segment). For the reasons in the next section I would be concerned about using this “real” small segment rather than relying on larger segments.
So why isn’t the parent method ironclad? Because sharing of a small segment between a parent and a child does not guarantee that the small segment is IBD. The first and foremost reason that these “confirmed” segments could still be a false positive for matching is that the segment could be shared by many members of a population or region (i.e., ancestry outside the genealogically-relevant timeframe), rather than just by closely related cousins. If many members of a population share a segment in common, it’s very easy to confuse that for close relatedness if you’re only looking at a few members. So, as an extreme example, if 75% of people from Upstate New York all carry a 5 cM segment that conveys increased shivering ability (hey, it’s cold up here), then if I test two Upstate New Yorkers it’s going to look like they have recent ancestry with a 5 cM segment, when in fact their common ancestor lived 500 years ago.
The Larger Segment Theory
I have had people tell me that if they share a large segment of DNA with a group of triangulated cousins, and they also share a small segment with the same group of triangulated cousins, then the small segment must be real. Taken to an extreme, it has been argued that people that match those “confirmed” small segments (whether or not they match the large segment) can triangulate to the same family group. Charted out, it might look something like this:
However, there are several real concerns with this, even if there were two or more of the small segments:
- New Cousin could be related through an entirely different line of the family, especially if this New Cousin comes from a similar time and place;
- The small segment sharing with New Cousin could be a false positive. Although more small segments (and/or more matching cousins) may decrease this likelihood, even 2 or 3 small shared segments like this could conceivably be pure chance;
- The segment could be from a “pile-up”, meaning that the small segment is widespread throughout a particular population and is not indicative of recent ancestry;
- And there may be more; I’d be interested to have others chime in here.
Thus, this use of small segments is especially problematic.
The hypothesis above could only work with larger segments, for example if New Cousin and James Johnson shared a large segment. But if that were the case, your argument would be based entirely on large segments and could avoid the problems associated with small segments.
Using Small Segments as Evidence?
So can small segments ever be utilized as evidence in a genealogical proof?
Well, whenever a proof argument uses a small segment as evidence – EVEN IF THE SMALL SEGMENT IS IN CONJUNCTION WITH LARGER SEGMENTS – scrutinize the information VERY closely and ask at least the following questions:
- Is the small segment shared by parents of BOTH matches? If there is no parent to test, or the answer is no, then you should probably discard the small segment. And even if it is shared by the parent, question whether it could be shared for another reason. Further, keep in mind that even if it is a “real” segment that I share with my parent, that is not a guarantee that it is a “real” segment in my match; maybe it is an IBS for them, not shared with their parent!
- How many of your other matches share the small segment? This is less clear, but the more people that share the small segment in your match list, the more concerned I would be, particularly if there is no known relationship.
There is no doubt that many small segments are indeed real, but the genetic genealogist’s ability to decipher between a “real” small segment and a “false” small segment is extremely limited and suspect. Ultimately I think it’s possible to identify segments that are probably real, but currently it is nearly impossible to use them in a meaningful way to support a genealogical hypothesis.
I’m very interested to hear your thoughts. How can we find ways to utilize these small segments? Are they hopeless, or is there a light at the end of the tunnel?