Starting today customers of Ancestry.com’s DNA service can download their raw data. As predicted by yours truly, (over the objections of the DNA police) it’s build 37 and Phased! What does this mean to you?
First, go to your ‘Your DNA Homepage’ on the DNA drop-down at Ancestry.com. Once there, click ‘Manage Test Settings to the right of the orange ‘View Results’ button. Then click, ‘Get Started’ on the right hand side (see arrow in picture above for where to find it.) A pop up will appear that asks for your password. Once you enter that, it will send an email to your registered email account. You can update your email address in your ancestry.com settings if you need to. Anyway, after you enter your password, it auto-emails you a verification email. You click on the link in that email (it took about 15 minutes to get mine, I imagine they are swamped today as it’s the first day) that says ‘Confirm Data Download’. Woosh off you go to Ancestry where there is a big green button on the upper left side of your screen that says ‘Download Raw DNA Data.’ Push that, and a pop up appears asking how you want to open it. Personally, I clicked ‘save’, moved the file to its own folder and then unzipped it. I had to open it in notepad but I might be able to move it to excel.
Build 37 is the latest major build release but it’s not the latest release. Currently, the NCIB is on build 37.p11. Still, this is the biggest release. Release 36 which only 23andMe remains on, has more gaps in the genome than build 37 does. Build 37 only has about 250 gaps in it and presumably build 38 will have fewer gaps but each build corrects fewer and fewer mistakes. The initial human genome release had over 150 thousand gaps but scientists have filled in the gaps using various methods. Build GRCh38 won’t be released until the summer of 2013.
According to the NCBI:
Build 37, also known as GRCh37, includes updates for all human chromosomes, closes 25 sequence gaps, corrects over 150 problems in build 36, and adds nine alternate loci.
So it’s more than just unknown information being filled in. There are five to ten problems per chromosome in the order of the DNA build 36. Clearly, if the SNPs aren’t in the order scientists previously thought they were, moving the segments around is likely to change your matching segment length to your cousins. (Although that’s not the only reason things changed at FTDNA… it seems clear they were cleaning up other shenanigans as well.) So build 37 has a lot of SNPs in different locations than previously believed.
Third Party Sites
Some have questioned why this is troublesome to third-party sites. One reason is that generally, they aren’t set up to read the RS number (the SNP’s name) but instead they read down the string of bases only not looking at the location or the RS number. So where you see this in your Ancestry Raw DNA:
1 887162 T T
1 888659 C C
1 891945 G G
1 893981 C C
1 894573 A A
1 903104 C C
1 904165 G G
1 910935 G G
1 918384 T T
1 918573 G G
1 924898 A A
1 927309 C C
1 928836 T T
Some third-party sites only see this:
Well, that selection was pretty boringly homozygous but you get the idea.
Chromosomes 23, 24 and 25
OK so the first 22 chromosomes we know are the autosomal ones. After checking some RS numbers, it’s clear that the 23rd chromosome is the X. So what are the 24 and 25th? The 24th is the Y – mine is almost completely blank. Come to think of it, why isn’t mine completely blank? And that pesky 25? It appears to be also X information. Not duplicate, but additional SNPs on the X that were tested. Rather odd but perhaps there’s a reason? [Editor's Note: Ann Turner caught that chromosome 25 consists of SNPs that exist on both the Y and the X! Good eye, Ann!]
Notice on the picture at the top of the post that the columns are aligned by alleles. This is very exciting because it would seem to indicate that the data was phased before release. Meaning, one allele came from your mom and one from your dad. It would be dynamite if we could use this information to ‘phase’ our results from other companies. Actually, we can do it manually but it would take a lot of time. I’m sure it would be fairly easy to program. It would be interesting to see which posers we can drop from our cousin lists. (little joke) The rest of the article assumes the ‘Alleles’ are phased. If they aren’t then this doesn’t apply. [Editor's note, CeCe Moore has reported from a convention that the results aren't phased. Too bad. I leave the rest for posterity or perhaps to inspire them to someday release the phased data.]
Note that the alleles are not labeled mom or dad. So allele1 in column 4 is likely from one parent and allele2 in column 5 is likely from the other parent but we don’t know which one (except for males who can ID their mom and dad’s X and Y chromosomes.)
What next? Well, one thing that is possible to do is to try to assign mom and dad to each of your alleles. Remember, that it will change from chromosome to chromosome. So let’s say you could determine that allele1 in chromosome 1 is your mom. That doesn’t mean allele1 in chromosome 2 is from your mom as well, it could be your dad’s. BUT! It is possible to figure that out, although surely tedious without serious programing. Let’s say you have a two segment cousin that you’re reasonably sure is only related on one side of your family. (Ashkenazi Jews and French Canadians – go have a seat, this probably doesn’t help you — SORRY!)
So, you have your FTDNA (or 23andMe matching data when the build gets fixed) and have a two segment cousin, let’s call him Fred. Fred matches you on chromosomes 3 and 21. You can ID the sequence region on your phased chromosomes and label those alleles Fred. Now you also match George on one of those chromosomes and George also matches Fred in the same area. You probably all three share the same sequences (the best thing would be to get them to share those sequences with you so you can check because it IS possible for all three of you to match each other on the same segment and yet have the matches be on different alleles! Yikes .) For convenience, we’ll assume Fred also matches you on the 2nd chromosome. If you ID which allele he matches you on, you can now say that the Fred alleles, on chromosome 3 and 21 are from Parent A and so is the George Allele on chromosome 2.Slowly, you could sort out your half from your mom and your half from your dad without a single close cousin, brother, aunt or parent.
Now that you’ve sorted yourself into that which you got from mom and that which you got from dad, why stop there? In theory, you don’t have to although the methods would change. Instead of using cousins to sort alleles, you’d use them to define segments. Areas that are never crossed by your cousins are areas where the crossover happened between your mom’s or dad’s sets of chromosomes. Those could, in theory, be used to define the quarter you got from each grandparent. Without a good tree, you may not be able to define which parent contributed what, but you can say that this is parent A and that parent B. That this is grandparent C, that segment comes from grandparent D, the bit over there from grandparent E and that tiny bit on chromosome 11 is from grandparent F.
All that would be easier if you had the segment info from ancestry where everyone has thousands of matches. But it is theoretically possible to extract that phased knowledge and apply it to our known segment matches on 23andMe (build 37 PLEASE!) and FTDNA.