The Shared cM Project (ScP) is a collaborative data collection and analysis project created to understand the ranges of shared cM associated with various known relationships. The ScP has been very successful, with more than 60,000 submissions from amazing genealogists like YOU! To add your data, the Submission Portal is HERE. I am always collecting data, and hopefully the next update will have more than 100,000 submissions!
The full PDF for Version 4.0 of the Shared cM Project is here and it is ESSENTIAL that you read the full PDF for all the details from the project: The Shared cM Project Version 4.0 (March 2020).
Today, the most recent version of the ScP, Version 4.0, goes live. I’ve taken nearly 60,000 submissions and analyzed the data for almost 50 different relationships. For each relationship the 100s or 1000s of submissions were analyzed to remove outliers, to provide minimum, maximum, average, and standard deviation values, and to generate a histogram for the distribution of the submissions. Here are some of the other differences between this new Version 4.0 and the previous version (click to enlarge):
Note that there ARE going to be some big changes to the minimum, maximum, and average values for some relationships compared to the previous version, as more data equals better data. From page 54 of the full PDF:
There are many changes to the minimum, average, and maximum values for relationships in Version 4.0 of the Shared cM Project relative to the prior Version 3.0. As the number of submissions for a relationship grows, the distribution of cM values for that relationship is more clearly defined. This allows for improved definition and elimination of outliers for each relationship. In some cases, the very large increase in submissions moved the minimum and/or maximum values further outward for a broader distribution in this version, and in other cases it moved the minimum and/or maximum values inward for a tighter distribution in this version.
There’s a table on pages 55 and 56 that show the percentage change for the minimum, maximum, and average values for the different relationships.
There is of course a brand new Relationship Chart with all of the new ranges and averages (click to enlarge):
Be sure to look for the gray bar at the top of the graphic that says “Version 4.0 (March 2020).” That’s how you’ll know you’re using the most recent version of the chart.
There are many more graphs and charts in the full PDF, so be sure to check it out!
The Interactive Shared cM Project at DNA Painter
In addition to the full report and the updated Relationship Chart, there’s another incredibly valuable aspect of this update. Since September 2017, Jonny Perl has hosted a web-based version of the ScP at DNA Painter. It allows you to put in a cM value and see which relationships that cM value falls into.
Jonny has graciously updated the tool with the new numbers, AND has added new features! If you click a relationship box, the histogram for that relationship will show in a pop-up! The histograms are now available for every relationship in the Relationship Chart, without having to refer back to the full ScP report.
Jonny has a blog post about the update with much more information at “Introducing the updated shared cM tool.”
It looks amazing and contains an incredible amount of information right at your fingertips. Thank you Jonny!
Obviously, this project couldn’t happen without YOU! Every person that has ever submitted even a single relationship has helped create this tool for the benefit of the entire community. And I am will continue to collect data indefinitely, so feel free to visit the Submission Portal HERE. Again, a huge thank you to each and every one of you.
Please also be sure to check out page 4 of the full ScP report for some special thank yous.
I hope you enjoy the update! Thank you!
Great job. You are a rockstar! Thanks for all the hard work which you do.
Thank you for the kind words Paul!
In the midst of the Coronapocalypse, Blaine sends us a huge and eagerly awaited present! Thanks, Doc! I’ve already had a little back-of-the-napkin fun with the standard deviations. I consider adding that info a very big plus. If you want me to table out some actual confidence interval runs from the raw data, just let me know. I contributed to version 3.0, and only 16 of the 32,999 new datapoints in version 4.0. But if we all submit information about the absolutely-for-certain known relationships we’ve found, we could add 60,000 new bits of data for version 5.0…and keep Blaine REALLY busy! 🙂
Haha! I’m planning to take at least a short break, but if I reach 100K tomorrow, I’ll have to do what I have to do! Thank you for the submissions!
Cheers — and a huge thank you!! This is awesome, and I’m eager to dive into the details. I’ve already submitted over 200 data points (yes, I keep my own spreadsheet so I don’t double-submit) and I’ve got dozens more to contribute to the eventual ver 5.0 For now, thank you so much, Blaine, for all the work you’ve put into this (and to all your assistants).
Thank YOU for the submissions!!
This is great, Blaine. Thanks to you and Jonny for your work on this. I have a question though. You have been collecting information on shared cM for relationships involving endogamy and pedigree collapse and I’m wondering if an updated chart will contain that data.
I have been collecting that data, but I don’t yet have enough data to compile. I envision it being a separate chart, but not quite sure yet.
The “The Pedigree Collapse, Double/Multiple Cousin, and ROH Shared cM Project” is here: https://forms.gle/9U8SVsYQXLsoVwLo6
Hi! Any update on this project? I just submitted the Google form for my great-uncle who is the product of a first-cousin marriage. Would love to know how to use the current relationship chart in the meantime, if anyone can direct me to a resource? Thanks!
This is fantastic, Blaine! Thanks for your intensive work and for making this version bigger and better than last time. And for the collaboration with Jonny Perl to bring it alive on the web.
One question – wondering about the slight differences in Parent and Child as well as Aunt/Uncle-Niece/Nephew. Shouldn’t those relationships be based on the same underlying data and be exactly the same as Great Aunt/Uncle-Great Niece-Nephew are?
I’ve fixed the Parent/Child but the graphic hasn’t updated yet. I will probably just leave the Aunt/Uncle/Niece/Nephew since it’s only off by 1 cM (updating everything means there are multiple versions of the chart floating around, and I don’t like that. Thank you for the feedback!
fabulous and love the histograms. The transparency by giving us the distribution charts shows the issue with sample size is small. Most are pretty smooth….well done
Thank you! A nice bell curve is a beautiful thing!
This is the most valuable tool i use in my genetic genealogy. As a medical scientist I have a few minor comments. I cannot see (may be I miss it) some definitions. You mention the mean and expected value. Normally I would say that the arithmetic mean is the expected value. Do You mean the median or modus? You present the standard deviation. However since most of the distributions are skewed and not normal distributed, the standard deviation is not the best parameter. Much better is to use the 95% Confidence Interval (Not calculated as 1.96 * SD but from the actual values). Continue with the good work, I am so thankful and so is my genealogy colleagues.
Thank you for the kind words! I used confidence intervals in the previous version, but they turned out to be too confusing for people so I simplified it in this version. I don’t use the SD, and I’m doubtful that it adds anything useful to the project, but people asked for it so I provided it.
The expected value is what would happen if every segment of DNA got exactly 50% smaller in each generation. So you would share exactly 12.5% with a first cousin, then exactly 6.25% with a 1C1R, and so on. That’s how we used to examine relationship possibilities before the Shared cM Project.
Thank you for all the work on this. Yes, similar to the prior comment, I was wondering if you have tried to determine the probability distribution that describes the histogram curve for different relationship pairs. For example, normal distribution, log normal distribution, Poisson distribution, etc. That could help understand the likelihood of relationships that are within range but may be very very rare (near the bottom or top of the range).
Thank you, Blaine, for your tireless work to make tools understandable and accessible. You are a true leader in the genetic genealogy community!
Dear Blaine, I would like to submit shared cM data for my family, and my husband’s family, but the form provided for submission of this data does not appear to be well-suited for my situation. Would you accept a submission in the form of a table, if it contains the information sought? I am one of six children of my parents, and we all tested on FTDNA and so have detailed segment data. Also, we have a half sister, and numerous cousins have tested, some of them sponsored by me. If you would accept a table containing information on my family and our cousins, and a separate table for the family of my husband, I could email the table to you. Thank you.
Is the full sibling relationship number correct? Might it be 3613 not 2613? 2613 seems a bit low, especially when parents and children share around 3500-3600.
Alec, remember Blaine has arrived at his numbers from actual values of about 60,000 matches that have been provided to him. For instance, the results for the matches each of my two sisters and me are: J) 2,659 cM across 60 segments; T) 2,472 cM across 59 segments.
While I agree that the company breakdown was infrequently used, I tried to always point people with questions about company differences to Table 3. It was the only source for such information. I was looking forward to the 2020 update using more samples, so am disappointed in the company breakdown not being included in the 2020 version. Guess I will still use the 2017 version for such questions, especially in the < 80 or so cM region where it starts to matter.
Great job Blaine! In this version the outliers were removed using the 99th percentile method just like in v3? The v4 PDF does not mention it, I wonder if this version includes or excludes the entries over the 99th percentile.
As I work 2nd through 4th cousin trees via Ancestry, your chart has been invaluable. I would gladly contribute some $ if needed anywhere, even for a well deserved bottle of whiskey!
It’s great to have this update. My big disappointment is the demise of Table 3. I have referred dozens of people that had questions about differences between companies to that often overlooked reference. Many people don’t understand the processing nuances for each company and don’t understand why the same pair of testers don’t have the same cMs at different companies. I forget if company was requested on this collection and if it is just a matter of generating Table 3 from existing data, or if the testing company was not asked for. Anyway, nice to have the update. Kudos.
I remember when we were forced to write an essay in college about the DNA structure. It was very difficult, but it was possible to implement it. There’s nothing super complicated about it. Fortunately, I was able to find people who were able to provide me with professional help on this issue. It was the only way to help me do this difficult work. They provided me with examples and so on. My supervisor was very surprised at this level of work. I recommend that you also think about looking at https://pickthewriter.com/ to see what companies provide such services. There are professionals here who know what they should do and how. So make sure you take a look and get acquainted with them. I think it will be useful for anyone who writes the same work now. It’s easy to make a mistake here, because a professional approach is just necessary. Besides, it will help you to enrich your knowledge with information from different sources.
I’m pretty confused at the moment as I’ve a 3C1R who recently took a DNA test and we share 230.1cM
This is very obviously great information and thank you so much for your efforts…one curveball occurs however when a situation like mine comes up where a couple of first cousins (and their kids and grandkids) are children of my Mom’s identical twin…the chart numbers don’t apply directly…
I did a dna test on my brother and received results from family tree dna and we matched at 2035cm. From the chart does this not mean that we could be half or full sibling as we fall into both categories in terms of min and max ranges? Maybe even 3/4 siblings.
I have a connection in ancestry dna where I have 1827 cms shared over 32 segments. Is this person more likely my Half sister or Niece.
Doesn’t your data suffer from selection bias, as it does not take into account all those matches for which there is no known relationship? E. G., I have many matches between 20 and 70 cM which I do not know the relationship for but I am almost certain it is more distant than 4th cousin.
Blaine; Thank you very much for gathering this data, performing the analysis, and publishing it in an accessible form. I’ve been using it to try to sort out who my GG Grandfather is and it has been very helpful. I think I’ve finally confirmed the person.
I’ve been working primarily with 3C, 3C1R, and 1/2 3C1R relationships and have noticed an issue when using the Average value you show for these distant relationships. I have a dataset that is comparing 7 descendants from my G grandfather to 6 descendants of the acknowledged children of my presumed GG grandfather. I have access to several cousins DNA results on Ancestry and have created a cross matrix of the relationships between the two families and the shared cM. Because most of the relationships are 3C or more distant, several of them share 0 cM. I’m interpreting your note on the shared cM chart to mean that you are not including 0 values in the calculation of your average numbers. IMOP, this could skew the average of your analysis to the high side. Is this correct? Would it be possible to publish a parallel analysis that shows the analysis results when you include the data in the 0 bin?
Also, for the analysis I am doing, it would be helpful to know the odds of getting 0 shared cM for each provider for a known given relationship. It looks like you have a great data set for doing that analysis. If its been done and I haven’t seen it, please send me a link.
Do you have calculated mean and standard deviation for each relationship level?
Rob Flanagan Stieglitz
I am specifically looking for a missing great grandfather and thus am keen to know if DNA Painter perhaps has a special chart just for the HALF relationships. I would like to know the range for HALF 4c and HALF 5C as well as the various removed. In essence i’m wondering if there is a chart with two extra columns on the left hand of the chart.
This would help me in narrowing down the possibilities a lot
I have the same issue as Dave Bennett. I would like those extra columns on the left side of the chart for half 3C and half 4C. Help!
Thank you for your project it looks amazing. The only concern I have is about DNA shared between parent and child.
Could you please explain how it’s possible to have somewhat different from 50% of DNA shared there? Ofc we have mutations including different chromosome abberations, though neither of them can bring such huge deviation from expected.
I recently discovered the Leeds Method for sorting out DNA matches. I have a 1C1R who shares 343cM with me (He is my maternal grandfather’s sister’s son). Then, comes some difficulty. My grandma was the oldest of 3, and my grandpa was in the middle of 10. After they married in 1925, one of my grandfather’s brothers married my grandma’s sister and had a child- a double first cousin of my mother and her siblings. Then, my grandma’s only brother married my grandpa’s youngest sister and they had 2 children. It is to their offspring that I’m having difficulty- I’m looking at the shared matches and I don’t recognize any of them. I don’t think or believe that all of those shared matches would connect to both maternal grandparents. How would I be able to sort it out?
Have you considered updating your Shared CM chart to reflect the amount of DNA data reported by FTDNA since they have reduced reporting match data below 8 cms? On one of my critical matches, the amount of shared DNA data was reduced by almost 50%.
Thanks for all of your great work.
I have some data to enter into a future version of the Shared cM Project. Speaking of DNA matches, I have access to a sister’s 23 and me data, my brother’s Ancestry DNA data, and my own Family Finder family tree DNA results that I have also downloaded to My Heritage. Also, I have identified on paper over 20,000 relatives including in-laws on ancestry.com.
Using those DNA data and my ancestry.com family tree, I have identified 270 people, the closest, a nephew matching 27.66% DNA of my sister’s DNA to a 6th cousin once removed with 59 centimorgans matching my brother. Please tell me how I can start sending you the data.