Last week I published “Small Matching Segments – Friend or Foe?” to join in the community’s conversation about the use of “small” segments of DNA, referring to segments 5 cM and smaller (although keep in mind that the term “small,” without a more specific definition, will mean different things to different people).
The question that the community has been struggling with is whether small segments of DNA can be used as genealogical evidence, and if so, how they can be used.
As I wrote in my post, a significant percentage of small segments are false positives, with the number at least 33% and likely much higher. In my examination and in the Durand paper I discuss, a false positive is defined as a small segment that is not shared between a child and at least one of the parents.
However, as CeCe Moore point out in her post “The Folly of Using Small Segments as Proof in Genealogical Research, Part One,” this is an artificial and misleading definition, as even “real” segments (those shared by a child and at least one of the parents) are often false positives because they are not found in three consecutive generations. The percentage of these “real” segments being found as false positives is unclear, but CeCe’s post makes it clear that it is unexpectedly high.
To continue the conversation, I’ve performed some additional analysis using data from myself and my parents. In this post I’ll address a couple of hypotheses the community has discussed over the past week as ways to possibly use small segments in a genealogically meaningful manner.
- Hypothesis #1 – A Small Segment with More SNPs is More Likely to be IBD
It has been hypothesized that the more SNPs a small segment has, the more likely the SNP is to be a real, IBD segment. This seems like a very logical and reasonable hypothesis, and I’m not immediately aware of any study that examines the question (if you know of one, please let me know in the comments!). So, to date, positions on one side or the other have been anecdotal.
To examine this hypothesis, I identified all small segments between 4 and 5 cM that I shared with my genetic matches at Family Tree DNA. For each segment, I determined whether it was “real” (shared with a parent) or “false” (not shared with a parent). Now as we’ve learned from CeCe Moore’s recent post (see “The Folly of Using Small Segments as Proof in Genealogical Research, Part One”), a significant percentage of these “real” segments shared by a child and parent are not shared with a grandparent, meaning that they are actually “false” segments. So even many of the “real” segments identified in these experiments are not IBD segments.
Of my 6,514 shared segments of 1 cM or larger, a total of 128 of these segments (1.96%) were equal to or larger than 4 cM and less than 5 cM. Of those 128 segments, about 20% were determined to be “real” (shared with my parent) and 80% were “false” (not shared with my parent):
Family Tree DNA also gives an approximate number of matching SNPs each segment. Matching segments of the same centiMorgan distance will have a variable number of SNPs depending on a variety of factors. To examine a possible relationship between the number of SNPs and the “real” versus “false” nature of a segment, I compared the number of SNPs for each of the two types of segments.
For my 102 “false” segments between 4 and 5 cM, the average number of tested SNPs was 833, while the average number of tested SNPs for my 26 “true” segments was 903 SNPs. Although this may at first appear to be an important distinction, based on the data I am still unable to point to a small segment and determine based on the test SNP whether the segment is likely to be real or false. The ranges of SNPs tested in the two types of segments was largely the same, as shown by the max and min.
Thus, based on this experiment, I know that I cannot determine the “real” versus “false” status of a small 4-5 cM segment of DNA based on the number of SNPs. Indeed, based on my data, I cannot even safely say that a 4-5 cM segment with as many as 3,300 SNPs (more than 3x the average) is more likely to be real.
- Hypothesis #2 – Small Segments Shared with Closer Relatives are More Likely to be Real
It has also been hypothesized that a small segment shared with a close or confirmed relative is more likely to be real than a small segment shared with a distant relative. In other words, if I share a large, confirmed segment with a cousin, then it is hypothesized that the small segments I share with that cousin are more likely to be real. Again, I’m not immediately aware of any study that has examined this question directly (again, if you know of one, please let me know in the comments!)
To examine this hypothesis, I took the first 25 genetic matches that: (1) were shared in common with me and one of my parents; and (2) shared a longest segment of either under 10 cMs, 10-15 cMs, or 15 to 30 cMs. At just the first 25 genetic matches it isn’t a very complete dataset, but it is at least a peek at what the numbers will look like. Indeed, finding the first 25 matches within the 15-30 cM category took me through the first third of my matching segment list.
In each of the three categories (under 10 cMs, 10-15 cMs, and 15 to 30 cMs) I counted all the small segments under 5 cM, and determined whether they were “real” or “true” (shared with the parent) or “false” (not shared with the parent).
As shown in the next two charts, the number of small segments per match decreased in each category from a high of 9.25 average small segments in the under 10 cM category to a low of 7.85 average small segments in the 15-30 cM category.
Similarly, the average number of “real” (called “true”) small segments per match increased (and the number of “false” small segments decreased) slightly in each category, although the percentages were largely the same between the under 10 cM category and the 10-15 cM category. The percentage of “true” small segments increased noticeably from about 25% when the largest shared segment was 15 cM or less, to about 31% when the largest shared segment was 15-30 cM. So there does appear to be some relationship between the closeness of the relationship and the increased likelihood of a small segment being “real” (“true”), but it is extremely tenuous and should not be relied upon as reasoning. Indeed, the increase seen below was only 6%.
Based on this data, I am completely unable to predict the “real” or “false” nature of a small segment under 5 cM based on how closely related I am to the genetic match.
Based on the above I predict that the percentage of “real” or “true” segments will increase as closer relatives are used (above 30 cM), but I also predict that it will never increase to a number that allows me to safely identify a small segment as “real” or “false” based on the degree of the relationship with the genetic match.
I am fully aware that these are not perfect analyses, but they’re a beginning. They bring further awareness to the issues associated with small segments, and provide readers with some additional ways to analyze small segments.
- A Great Point from a Commentator
I received an excellent comment (HERE) from Ann Turner, M.D., co-author of “Trace Your Roots With DNA,” on my post “Small Matching Segments – Friend or Foe?” last week. In the comment, she wisely noted that – counterintuitively – the number of segments shared with relatives remains largely unchanged regardless of the distance of the relationship.
…For me, the average number of segments in the different relationship bins doesn’t change much at all:
3rd to 5th: 11.9 segments
4th to distant: 11.5 segments
5th to distant: 11.4 segments
To me, that’s an indication that it’s too easy to construct short segments just by coincidence….
It’s well worth reading the full comment (HERE).
So once again, the data I’ve seen leads me to conclude that the genetic genealogist’s ability to decipher between a “real” small segment and a “false” small segment is extremely limited. Ultimately I think it will be possible to identify segments that are probably real, but currently it is nearly impossible to use them in a meaningful way to support a genealogical hypothesis.