In preparation for future posts, I thought it might be a good idea to give an introduction to the more technical side (to the extent that bio-ignorant me is capable) of genetics as it relates to deCODEme and its cousins. I have only a vague sense of what words like ‘haplogroup’ mean, so this exercise will be for my own education as it is for anyone else’s. I’m sure Jon (and you readers) will jump in to correct any severe errors and offer more serious explanations.
Taking a cue from one of Jon’s unpublished posts, I’ll start with deCODE’s explanation of what they do.
What is deCODEme?
deCODEme will help me:
- Discover my gene profile
- Discover the origins of my ancestors
- Compare two genomes
- Compare my physical attributes and genetics
Says deCODE:
Over the past decade, we at deCODE have analyzed the genomes of hundreds of thousands of people. [...] We will analyze your genetic information, store it securely, and provide you with updates on your genetic profile as new knowledge becomes available in the field.
Analyze?
The tissue sample that you provide to deCODE is analyzed using the “Human1M DNA Analysis Beadchip.” According to the press release the BeadChip is “the Industry’s first single-chip DNA analysis solution to contain more than one million SNPs.” More unfamiliar words… “BeadChip.” “SNP.” I gather that the “1M” refers to the “1 million SNPs”, whatever those are.
“Beadchips”
BeadChip is Illumina’s brand name for the bead array technology. The ‘beads’ in a bead array are microscopic things treated with dye and attached to an optical fiber. Depending on the treatment of the bead, it reacts differently to different sequences of genes, and this reaction can be optically measured. The bead arrays can be tuned to detect certain parts of DNA, which in Illumina’s case, have been worked out in co-operation with deCODE.
SNPs?
Wikipedia tells me that ‘SNP’ means “single nucleotide polymorphism.”
A single nucleotide polymorphism (SNP, pronounced snip), is a DNA sequence variation occurring when a single nucleotide – A, T, C, or G – in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual). For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide.
A ’snip’ seems to be a single change of letter in the familiar strings of As, Cs, Gs, and Ts that many of us will recognize as representing a DNA sequence. So, for example, if a portion of your DNA reads “GATTACA”, and someone else’s says “CATTACA” at the same place, then this may be a SNP. There’s also a statistical component to the definition. A SNP is only a SNP if the variations (or ‘alleles’) occur in about 1% of a population. (This begs the question, “Which population?” which is addressed below.) Presumably, if I am the only one in the world with “CATTACA”, then it’s not a SNP, but if (maybe) 60 million also have “CATTACA”, while the rest of the world’s humans have “GATTACA”, then it’s a SNP.
This presentation explains that there are probably between 1.6 to 12 million SNPs in the human genome, out of approximately 3.2 billion base pairs (the letters) in the human genome. These SNPs are supposed to represent one of the main sources of genetic variation among humans. The Illumina chip measures variations at 1 million of these points, and produces a genetic profile. So what deCODE gives you is not a ‘complete’ genome of several billion As, Gs, Cs, and Ts, as some people (me) might have expected, but more a listing of answers to 1 million genetic ‘questions’ that deCODE and Illumina thought it was worth asking.
These 1 million points were selected by deCODE and Illumina using their powers of intelligent SNP selection (pdf link.) deCODE and Illumina start with data that associates the risk of disease or frequency of a physical trait with particular SNP variations. They then do some statistics using data derived from more sophisticated equipment to determine how these variations can be most efficiently measured. Out of the millions of potential SNP choices, some are designated as “tag SNPs” which are meant to be useful stand-ins for other SNPs. From what I can gather, tag SNPs are determined through some kind of statistical analysis that shows Illumina and deCODE which differences matter the most (“high information content”), and which are likely to mean the same thing (maximum “genomic coverage”.) The ‘differences that matter’ part was determined by setting a threshold for the occurrence of the dominant allelle (the more common letter) at 95% within a certain population. Illumina used four populations based on ethnicity (“Caucasians in Utah, Han Chinese in Beijing, Japanese in Tokyo, and Yoruba in Ibadan, Nigeria”) to find out which SNPs fit this criteria. (This begins to answer the ‘which population?’ question for the SNP definition above. The answer is surprisingly close to “whoever happens to be around!”) “Genomic coverage” is achieved again through statistical analysis that looks for any SNP variations that occur together often enough to be considered one group, or representative of each other. These are both essentially strategies for deriving useful genetic information using a simpler and more economical chip design.
2. Discover the origins of my ancestors
deCODE will provide you with a data file that contains information about all 1 million of your SNPs, but their main service is in giving you easy access to what the data ‘really means.’ deCODE updates their database with ‘reliably confirmed’ findings to refine their interpretation of your SNP profile. One part of their service promises to allow me “to virtually reconstruct the geographical distribution of your ancestors back hundreds or even thousands of generations.”
What gene chip information do they use to determine ancestry and geographic distribution? There’s a hint on the same page.
Through an intuitive interface, you will be able to view how your mother and father and their ancestors contributed to your genome and how much of your genome is derived from people from Africa, Europe, or Asia.
If we think back to the Illumina press release, we might think that by “Africa, Europe, or Asia”, they mean “Ibadan, Utah, Beijing, and Tokyo”, but let’s leave this aside for now. Their FAQ says:
You will be able to use the results to tell yourself about your ancestry in the broad sense of ethnicity, but not specifically to trace your exact geographical origin. Where other services are generally only using sex specific transmission factors (mitochondrial and Y chromosome markers) our results take the whole genome into account. We compare your genetic similarity with over 50 ethnicity groups and we also analyze your ancestry down to individual chromosomal regions.
So their analysis of ancestry is not based simply on the four-way Utah-Ibadan-Beijing-Tokyo ethnic distinction, but on more than 50 ethnicities. In fact, if you log in to deCODEme and search for friends, alongside people claiming to be Nature Genetics editor Myles Paxton and deCODE CEO Kári Stefánsson you’ll come across long-lost buddies like “Adygei, A reference individual for the Adygei population” and “BiakaPygmy, A reference individual for the Biaka Pygmy population.”
As both Jon and I have mentioned in previous posts there are interesting online communities made up of people enthusiastic about genetic genealogy and using services like deCODEme to trace their roots. deCODE’s website doesn’t do a great job of explaining exactly how they figure out ancestry based on a person’s genetic profile, but I suspect that it’s largely based on measuring similarities in SNP variation patterns with ‘reference individuals’. Maybe they send people out to gather samples from all over the world. This post at the RootsWeb genealogy site suggests that deCODEme uses data from the International HapMap Project.
The International HapMap Project is a “multi-country effort to identify and catalog genetic similarities and differences in human beings.” Groups from Japan, the UK, Canada, China, the US and Nigeria are involved in various capacities (“community engagement”, “sample collection”, “analysis”, “genotyping”, etc.) to put together analyses of ‘haplotypes’. The groups that are participating in the project seem to explain why Illumina and deCODE chose the ethnicities they did to select tag SNPs for their DNA chip. This project is probably also where they are getting their reference individual data from.
What is a ‘haplotype’?
The HapMap website explains:
The development of the HapMap will enable geneticists to take advantage of how SNPs and other genetic variants are organized on chromosomes. Genetic variants that are near each other tend
to be inherited together. For example, all of the people who have an A rather than a G at a particular location in a chromosome can have identical genetic variants at other SNPs in the chromosomal region surrounding the A. These regions of linked variants are known as haplotypes.
This concept sounds very similar to the “tag SNPs” that deCODE and Illumina use in their DNA chip design, and are probably equivalent for some purposes. However, where ‘tag SNP’ was a concept derived from statistics, ‘haplotype’ is perhaps best understood in the context of heredity. The HapMap site goes on to explain that haplotypes are the product of histories of sexual reproduction. In short, haplotypes are the bits of genes that are passed from parent to child, and are therefore shared between siblings, and over time, by local interbreeding populations. Haplotypes can then be clumped together in ‘haplogroups’ (“a group of similar haplotypes that share a common ancestor with a single nucleotide polymorphism (SNP) mutation”) which can be associated with certain ethnic groups, based on historical migration data and other non-genetic information. It’s important to note that haplogroups don’t define ethnic groups using genetic information, but associate genetic profiles with certain modern-day ethnic groups. deCODEme is not clear on this point, but thankfully, competitor 23andME explains things a bit better in this white paper. The important bit is their definition of “Ethnic group”.
Ethnic group: Any set of individuals affiliated with a socially-constructed label, where that label indicates sharing of a set of cultural characteristics. Individuals may have more than one ethnic identity, and an individual’s ethnic identity may change through time. Because ethnic identity has developed in the context of human migrations, ethnicity is sometimes correlated with patterns of genetic variation.
It’s an interesting thing they’ve done here, which is to de-emphasize any causal relationship between genetics and ethnicity. It acknowledges that ethnic identities can be multiple and changing, and that they are not biologically determined, but it still maintains a certain, albeit indistinct, relationship between ethnicity and genetics. The connection between haplogroup and ethnicity (or culture) is made using historical patterns of migration and segregation as a proxy.
What are the haplogroups that people are interested in?
This is maybe where things start to get really interesting, and I’ll continue writing about this in a few days.


SNPs, and Haplotypes, and Haplogroups, oh my.
By: Jon on May 6, 2008
at 3:54 pm