The genealogical community has a serious issue we need to talk about.
We are amassing one of the largest collections of genealogical information ever created, in the form of DNA match data. As of October 2018, approximately 20 million people have taken autosomal DNA (atDNA) tests, and that number continues to grow rapidly. DNA evidence is being added as an additional record to support existing genealogical conclusion, being used to generate new hypotheses, and helping break down decades-old brick walls.
However, since many genetic matches are either unwilling or unable to respond to communication or provide permission for public use of the genetic data, much of the massive database is potentially locked behind privacy walls such that the information can’t be utilized in scholarship and can’t be publicly shared. Indeed, Standard #8 of the Genetic Genealogy Standards (PDF) mandates the following:
As a result, we can’t name a living match in our writings (such as blog posts, social media posts, or articles) or presentations without written or oral permission from the match. This is potentially a devastating loss.
Is there anything a genealogist can do, or use from the genetic match information, if the match is non-responsive or refuses to grant permission for public use of the match information? Let’s look at some possibilities.
(This blog post is intentionally focused on AncestryDNA matches since there is no segment data (only a total shared cM amount), making it potentially the most difficult to use in these types of genealogical conclusions.)
Using Semi-Anonymous Genetic Match Information
First, there is at least one maxim that should NOT be part of this conversation: we cannot provide the name or username of a LIVING genetic match in a public writing or presentation without the permission, written or oral, of the living genetic match. Period, non-negotiable.
Beyond that maxim, what can we do to utilize matching data such as that provided by AncestryDNA, where the match doesn’t respond or provide permission?
In the following figure I’ve provided a real-life example. The subject (Allen Mullin) of this study is a great-great-grandchild of Hamilton and Susanna (Steen) Colwell, and he shares DNA with three of his third cousins (Match #1, Match #2, and Match #3). The amount of DNA each shares with Allen Mullin is provided in the figure. However, the name of each match and the name of the parent is anonymized for privacy.
Should we anonymize just the parent (as in this example)? Or should we privatize the parent? Or more?
To play devil’s advocate, doesn’t anonymizing the parent still technically reveal the identity of the three matches, in contravention of the Genetic Genealogy Standards? After all, we can form a reasonable hypothesis about who they are.
Maybe not. A small technicality is that I’m not identifying the matches; instead, the reader must go out and intentionally identify or try to identify the name of the test taker.
However, the biggest mitigating factor is that the universe of possible matches is large, and this doesn’t identify which people in that universe of possible matches has undergone DNA testing. Indeed, the argument that we can hypothesize who they are may even weigh against the argument that we cannot use this information. In this example, Elizabeth Colwell 10 grandchildren, Hamilton Henry Colwell has 12 grandchildren, and Jane Viola Colwell has 17 grandchildren. Thus, there are many people that Match #1, Match #2, and Match #3 could be, at least in this example. In other examples, there may be fewer possible people in the universe (potentially even just a single person).
Here’s another version, showing the initials for the parent:
Is that private enough?
We Already Knew the Universe of Possible Matches
Indeed, to build on that last point, I already knew the universe of possible matches, and so did you, even before DNA testing! We can use traditional genealogical research to identify all the descendants of Hamilton and Susanna (Steen) Caldwell/Colwell using publicly available information. All we’ve added to this figure is the amount of DNA shared by several people we’ve already identify in the tree.
It may be different for cases where a family or relationship was not known prior to a conclusion that pieced the family or relationship back together. However, once an ancestral couple is identified, a genealogist can typically identify the descendants of that ancestral couple using traditional genealogical research.
Obviously, genealogical conclusions that involve a misattributed parentage event or other potentially emotional result will likely need to be handled differently, potentially with greater privacy. Additionally, the issue of using genetic data from matches that are responsive but refuse to provide permission is another important consideration.
One of the problems faced by a genealogist using matching information such as that provided in the figure above, is that there is no replicability. Every genealogical conclusion should be replicable, meaning that the same conclusion should be reached if the research is repeated by another genealogist.
This is one of the important reasons why we cite our sources (in addition to, for example, providing information about a source such that we can evaluate its ability to serve as a source, the strength/quality of the source, and so on).
However, as a scientist, I know there are issues with the replicability of the genetic match information, since I do not – and should not – have access to the test taker’s account. For example, providing the name of the match (with permission) does not give me any greater ability to replicate the data, since I can’t access a test taker’s account as a third party. Suggesting that a researcher independently re-test a test taker as a method of replicability is not realistic.
Transferring raw data of two matches to GEDmatch to enable sharing of GEDmatch profile information (“kit numbers”) to researchers to perform independent comparisons is a good way to deal with replicability, but even then it does not guarantee that there is or will be replicability. Was the proper data transferred to GEDmatch? Will GEDmatch be available in the future? Can’t GEDmatch be gamed with data? Additionally, although I very much laud the inclusion of GEDmatch profile information in a genealogical writing to enable readers to replicate the research, this puts in place an even higher barrier and reduces the amount of usable genetic data by several orders of magnitude. True replicability likely requires access to the raw data of the two test takers, or even saliva samples from the two test takers, neither of which are possible.
Providing Supporting Documentation
One way to address the replicability issue is to provide information that essentially recreates the match information seen by the two matching individuals. This information could be provided in the genealogical writing or presentation, or included as supplementary information.
In the images below, for example, I’ve provided screenshots of the three genetic matches (Match #1, Match #2, and Match #3) from the image above with the names blurred for privacy, seen with respect to the subject of the research question (Allen Mullin):
As suggested by someone in the GGT&T Facebook group, the screenshots are probably better if they include the name of the test taker, shown by the “Member Matches for Allen Mullin” in this image:
The conclusion that we cannot use the genetic data of non-responsive matches is potentially too far-reaching. There may be ways we can utilize this DNA evidence in public writings and presentations to support our genealogical conclusions.
What are your thoughts?