Study Designs: Genetic Association Studies

Study Designs: Genetic Association Studies


[Dr. Teri Manolio]
So, I’d like to cover
case-control and cohort studies. I won’t talk about to a large
degree about the design issues involved in these,
because I realize I’m, you know, sort of preaching
to the choir talking to a group of epidemiologists, but will
make a couple of comments about that. And then candidate gene studies,
genome-wide association and a little bit about randomized
and experimental designs. You may have seen this, it
came from Francis Collins in “Nature”
of 2004. This is actually shortly after
we had started working together. He had asked me to come over and
just help out with a workshop they wanted to have on
the possibility of maybe doing a very large,
and when genomicists say very large, you know,
remember, they think in terms of three billion base
pairs, so he was thinking of a 500,000-person cohort study
of genes and environment. This piece came out after a
yearlong effort to sort of map out what such a study
might look like. And a panel of experts that
we had brought together recommended that obviously a
study would need to be large, because you want to capture lots
of different diseases and the diversity of the
U.S. population, you want it to fully represent
the U.S. minority groups, have a broad range of ages,
broad ranges of genetic backgrounds and environmental
exposures. You’d probably want to have
some family-based recruitment for at least part of the study
to account for population stratification that Tom will
be talking about tomorrow, lots of data to be collected on
these people if you go to all the trouble of recruiting them,
you’d want to have very technologically advanced
measures, you’d want to collect and store biologic
specimens, you’d need a sophisticated data management
system, a whole number of things that were sort
of recommended as the, you know, the next big cohort
study — big, big, big cohort study, as it were. And this received sort of
a variable response. This is another Gary Larson:
“Now, stay calm. Let’s hear what they
said to Bill.” So, there were those who said
this would cost way too much, it was way too big,
it wasn’t necessary, it was already being done,
there — you know, a variety of responses that
one often gets. And we then went back and
sort of tried to lay out in more detail what some of the
problems are with case-control studies, because particularly
in the genetics literature this has been, you know, really
the darling of this — of the approach, primarily
because it’s easy — you know, people
think it’s easy. It’s hard to do well,
but it’s easy to collect cases and compare them
to a bunch of controls. And then there was this kind
of back and forth with Walt Willitt’s group
suggesting that, well, if we needed to come up with a
new cohort study because we all recognize the strengths of
cohorts, maybe we could just merge the cohorts we
have rather than come up with a whole —
an entirely new one. And we were asked to kind of
write, you know, kind of a point-counterpoint
to that, saying, yes, that’s all well and good,
but if you were to do that, you would end up with studies
that were kind of a conglomeration of existing
cohorts recognizing that, just a simple example,
the age distribution of existing cohorts,
which is shown here in a survey that we had done
of cohorts in preparation for this planning process, did not at all approximate
the age distribution of the U.S. population, which is
shown here according to the U.S. census. And a variety of other things
that would be shortcomings of existing cohorts recognizing
that there are many strengths and we should make use
of those strengths, but in addition should also
do a large cohort study. And we’ve talked a fair amount
with geneticists about the pros and cons of case-control
studies, recognizing that a major strength of them is that
it’s probably the only way to study rare disease or
those of long latency, again, not something I need to
emphasize with this audience. Existing records can be used,
you can study multiple etiologic factors
simultaneously. They may be and often are
less time consuming, often they are less costly and
if the assumptions are met, the inferences are reliable. And then one tries to sort of
communicate, but there are some real challenges with these:
you’re relying on recall or records for information on past
exposures, it’s very difficult to validate that information
at times, selecting the appropriate comparison group can
be difficult and no matter what group you select, reviewers or
colleagues or competitors will criticize it. Multiple biases can give
spurious evidence of an association, you usually can’t
study rare exposures and temporal relationships
between exposures and disease can be very difficult
to define. So, when I sell these things
to my genetics colleagues, they say, “But this is genetics,
you dumb epidemiologist. This is different. Genes are measured the same way
in the cases and controls, not a problem. So, information on the key
exposure is really very easy to validate. There’s no recall or reporting
involved and the temporal relationship, the genes were
present since conception, so, you know, that’s
a piece of cake.” My response to that is often
that bias-free ascertainment of cases and controls is
still a major concern, and we’ll talk more about —
about some [unintelligible], especially about some of
the biases involved in collecting these. And cases in most clinical
series are highly unlikely to be representative and that
assessment of risk modifiers or gene-environment interactions
is highly likely in a case-control study to be
incomplete or flawed for a variety of reasons. And so, so we go back and forth
on this and, as you can imagine, sometimes being the only
epidemiologist in the room, it’s a challenge. That’s why it’s been nice
having Tom around. So, so weaknesses of —
appreciating weaknesses of case-control studies. I think at times epidemiologists
tend to view them as this Larson cartoon, the monster
climbing in the window, where the geneticists tend
to view us as kind of Chicken Little, you know,
complaining that the sky is falling and, you know,
can’t we look at some of the successes that there have
been, and there have been some major successes
in this area. And probably the truth lies
somewhere in between, and we need to be sure we use
both designs but recognize their weaknesses. So, I thought I’d talk a little
bit about candidate genes, especially in this
election year. Genetics studies prior to 2005
were almost exclusively this kind of work,
candidate gene work. The goal was to characterize
what we would call a candidate based on what we might
know about the genetic — the biologic pathway,
the genetic mechanism, et cetera, what might be
related to the disease. Usually these kinds of studies
weren’t intended to find genes or find variants related
to disease as linkage studies were. Linkage studies, you set up all
the signposts across the genome and you try to figure out which
of the signposts seems to be inherited with the disease and
then maybe something near the signpost is related to — is a
variant related to disease. But these were generally done
after the — the potentially disease-related variants
were identified. One could assess
generalizability of observations in families,
family-based observations, such as the BRCA1 variants,
which showed very, very strong influence of risk in
Ashkenazi Jews but much less of an effect and much lower
prevalence in other groups. Assess the importance of allelic
variation at the population level, again, population
attributable risk penetrants, the geneticists often call it,
which is the likelihood that the gene will be expressed or
the disease will be present in someone who carries
a particular variant. And then identifying
modifications of genetic associations by
environmental factors. So these are all the sort of
things that population-based epidemiologists would tend
to do with candidate genes. One of the first and perhaps
best characterized of the candidate gene studies was
the angiotensin-converting enzyme variants. The ACE enzyme had been
identified long ago, this is just from a textbook,
but this was probably in Guidant’s work long ago,
in terms of the pathway that leads to increased blood
pressure and vasoconstriction. And the gene was identified
through a variety of very elegant steps that were very
difficult at the time, 1989, 1991, et cetera, linked
to elevated blood pressures in rat models, et cetera, and
then finally mapped to the human chromosome 17
by Jeunemaitre. And then also showed to be
associated with levels of ACE — the ACE enzyme levels,
which is very nice, but if you think you’ve found
your gene, you would like to see that it’s associated with
differences in levels. And as you can see, this is an
insertion-deletion polymorphism, Tom mentioned
indels before. There was a small segment of
DNA that was either present or deleted in carriers of this
variant, and it was found through RFLP analysis. And again, this is a 250
base pair insertion. And people who were homozygous
for the insertion had higher — sorry, had lower ACE levels than
those who were heterozygous, and much lower than those who
were homozygous for the deletion, and this is true
on a log scale as well. What really sort of shook the
world was this Cambien paper, and some of you may remember
when this came out. This was typed actually
in a prospective study, the ECTIM study — I’ve
forgotten what it stands for; it may be on one of these
slides — but, at any rate, identifying this particular
deletion polymorphism as a potent risk factor for
myocardial infarction from this French group. And what they showed was that
in 1,304 MI cases and a group of matched controls — sorry,
this is the total number of cases and controls, that those
with a DD polymorphism had somewhat greater risk,
the insertion-deletion heterozygotes had a — sorry,
a lower risk and those with the insertion polymorphism
was basically not changed. This was a relatively weak
association, although it was significant given the numbers
that they studied. But when they stratified on
low risk versus high risk, this association really came
out and people who were homozygous for the deletion
had a much higher risk, an odds ratio of three,
versus those who were heterozygous or carried the
homozygous insertion and those who were already at
higher risk based on other cardiovascular risk factors
really didn’t seem to have much association with
this polymorphism. So, this caused lots
of excitement. Many, many prospective studies
rushed to try to replicate this finding. Not many were able to do so,
and in fact, in hypertension you’d see papers come out
like this, my colleague Chris O’Donnell from Framingham,
showing an association with the DD polymorphism and
hypertension, for example, with the reasonable association,
but really nothing in women, a lot of times inconsistencies
and it really wasn’t clear sort of what was going
on with this. Well, what was probably going on
with it was perhaps a spurious association, an association that
seemed to make sense that was found in one study that,
you know, for whatever reason, luck or lack of luck
or whatever, identified it, but then when we tried
to replicate it, it was not found
subsequently. And actually, Hirschhorn did a
very nice review of this issue in 2002 in “Genetics
in Medicine,” showing the number of
association studies that really started to take off as
the genotyping technology got better, and as we identified
more and more polymorphisms and he had this huge peak,
and this pretty much continued on up. It’s leveled off a
little bit into 2002. And yet, of the 600 papers
that they identified that had reported an association in
more than two or three studies, I believe only six, so 1 percent
of them, were significantly associated in more than
three-quarters of the identified studies. One of them is near and dear
to us in cardiovascular epidemiology, the factor V
leiden variant with deep venous thrombosis. There are a couple
of others here. The ApoE is one of the strongest
associations that’s ever been found with Alzheimer’s disease,
but these were the only six really that came out as being
robust and replicated. We did a similar sort
of exercise in carotid atherosclerosis. We basically took all of the
variants that had been reported somewhere as being
associated with coronary artery disease and said,
well, we recognize that coronary disease is sort
of a distal phenotype, and there are many things that
lead to coronary disease, one of them is
atherosclerosis. That may be a little bit
closer to the genetic, you know, product, whatever it
might be, and so perhaps as kind of an intermediate
phenotype or a more objective measure it would be
something that would show a stronger association. And actually when we looked
at that with a variety of variants, you notice the
ACE insertion-deletion, there were 13 that showed an —
13 studies showed an association, one of them showed
it with the different allele, 18 showed no association,
so the summary sort of favored none and many
favored no association. And many of these variants then
that had the strongest evidence initially really, you know,
pretty much were equivocal. The only one that kind of
came out was the MMP3, matrix metalloproteinase 3,
probably because there were only four studies at the time
we did this review that had looked at it and as it was
looked at more it sort of dropped out as well. So, candidate gene studies
were not terribly fruitful. Initial enthusiasm has been
markedly damped by failure to replicate findings. And the point has been made that
you can probably find a study or a story or some kind of a
biologic pathway that will fit almost any candidate to
almost any disease or trait. And understanding of the genome
function is — was just too preliminary at this point to
project more than really a handful of plausible
candidates. I want to point out that the
ApoE was never in a million years thought to be related
to Alzheimer’s disease. I mean, that was, in a way,
a fluke finding. It was found in a linkage study,
but it was incredibly strong, it was replicated again and
again and again and again, and now we have some really good
pathophysiologic reason for that to be there. So, it — this wasn’t
a fruitless effort, but it came up with a lot
of false positives. So, a paradigm change was needed
much like these fellows are having here. “Hey, they’re lighting
their arrows. Can they do that?” And we needed perhaps to burn
the barn down a little bit, and the way we did that was,
as we talked about last time and Tom has mentioned,
through genome-wide association studies. So, here’s a cariogram with all
the various bands of the genome shown here. These bands tend
to stain. They’re just GC rich regions,
if you’re interested in cytogenetics, there are people
who spend their entire careers doing this kind
of work. But, in 2005 we had basically
no genes for common diseases, common complex diseases,
and we refer to complex diseases as the ones that,
you know, you had hoped were single-gene Mendelian but
you didn’t find the Mendelian gene so it must
be complex. But, at any rate, in 2005 there
was one variant identified for a complex disease. It was here on chromosome 1
for complement factor H in age-related macular
degeneration. In 2006, there were two others
both on chromosome 1 for QT interval and — sorry,
QT interval is down here, and inflammatory bowel
disease and then another on chromosome 10 for age-related
macular degeneration. 2007 really was the year of
genome-wide association studies. We’ve sort of broken
it up into quarters, and just to kind of page through
this, the second quarter was incredibly productive. Several studies simultaneously,
prostate cancer, breast cancer, diabetes, the Wellcome Trust
Case-Control Consortium, all kinds of findings,
all of them replicating; really very, very exciting. Third quarter, fourth quarter
and then just the first quarter of 2008. So, this has been really
an incredible progression. I think when genome-wide
association started there was a prediction that, you know,
within five years we’d have identified, you know,
five variants for maybe 10 diseases and we’ve done
far better than that. So, 2007 was sort of dubbed
the year of genome-wide association studies. This was “Science,” I believe,
that dubbed it. Its breakthrough of the year was
the measures of human genetic variation made possible through
the HapMap and similar kinds of genetic tools. And just the number of
studies in our catalog, which I’d encourage you to look
at and let us know if there are ways you think it can be
improved or things that you find wrong with
it or whatever. But there have been 53 traits
now that have had published genome-wide studies,
some of them in things you wouldn’t think would be,
you know, of a particular public health importance,
although many are. Restless legs is one that’s
often identified as, gee, why do we have genome-wide
studies of that and we don’t have it of, you know, fetal
malformations and that? And if you’ve ever dealt with
somebody who has restless leg syndrome, it’s actually a very
common and very troublesome kind of condition. And here’s our catalog here,
so just to give you sort of a screen shot of it. If you Google “GWAS catalog,”
I think it comes right up. And these are the pieces
of information that we’re pulling out. We decided we had to track this
anyway for some work that we were doing and we thought maybe,
you know, in sort of the genome way we would make this
information available to the scientific community,
so we collect all of this information. The stuff in white is kind of
the easy stuff to pull out. It’s the things in blue,
the gene region, the genes that are in that
region, the strongest SNP and the risk allele,
risk allele frequencies, p-values and odds ratios are a
little bit harder to pull out. We want to make sure that
that information is right, so it tends to lag
a little bit. We’ll identify a study and then
say this stuff is pending. And we’ve kind of committed to
doing this through the end of 2008, and if we’re all still
standing at that time we may continue, we may not. But — and then this is just
an example of sort of the full catalog and shows you
disease, trait, replication, sample size, region gene,
strongest SNP, et cetera. Some studies had more than one
SNP identified and we tried to show them there. We kind of picked the top five
just as a starting point and we may try to go back and
pull out more of those. And we tended to pull out the
top five sort of new ones. So, this isn’t a real good
resource for telling what’s been replicated, and that’s
been a criticism, and one we’re trying
to address, okay? [Male Speaker]
I was just curious
[inaudible]. [Dr. Teri Manolio]
Strongest, I’m sorry. [Male Speaker]
[Inaudible] I wonder what —
what percent of variation actually that’s explained
by these — [Dr. Teri Manolio]
Right, so your question is,
for the strongest risk allele or risk alleles, what percent
of variation — [Male Speaker]
Or alleles. I mean, you identify a
bunch of them, I mean — [Dr. Teri Manolio]
Yeah, and it’s very small. It’s probably less than
5 percent of the genetic variation. [Male Speaker]
[Unintelligible] [Dr. Teri Manolio]
I mean, I showed you a couple
of examples where the authors are saying that it’s explaining
more and I’m, you know — I question that, but it’s very
little, 5, 10 percent at most. [Male Speaker]
[Unintelligible]. [Dr. Teri Manolio]
Yeah, yeah — often,
often it’s not. Yeah, there’s no —
sort of no estimate. So, as I say, we’ll do this as
long as we hold out anyway. And then just to remind folks
what a genome-wide association study is, it’s a way
of interrogating all 10 million variable points
across the genome, recognizing that since it’s
inherited — that variation is inherited in groups,
you don’t have to test everything, all
10 million points. You can test just a few of them,
300,000 to 100,000 for example. And what one could do,
taking the Samani study from the “New England Journal
of Coronary Disease,” just their most strongly
associated — and when I say strong I’m coming
from the genome world, which is actually the p-value,
so we’re talking about the significance level. And these folks calculate
significance levels to three digits — I mean, three digits
on the exponent, so, you know, 10 to the minus 100th. When I was a baby epidemiologist
they taught us, you know, nothing less than 10 point —
10 to the minus fourth was even worth reporting. You just report those all
as 0.0001 and move on, but times have changed. So, their — so their SNP with
the highest p-value that sort of survived replication,
et cetera, was this one. I was taught by a friend and
I found something very useful was just to sort of look for
SNPs by their last four — you know, you get asked
your last four of your social security number,
so I know this one fondly as 3049. But, at any rate, you find that
it’s very easy when you’re looking through papers than
trying to find these things. And what the 3049 C allele
was found in 55 percent of the cases, only 47 percent
of the controls, and the G allele conversely,
giving an allelic odds ratio that is, for carrying the
allele, either one or two copies, it doesn’t matter,
just what’s the allelic odds ratio, 1.4, a very high
chi-square and a very small p-value associated with that. One can also look at the three
genotypes, so you could either say do you have C or not,
or you could say are you a CC, CG or GG. And in this study again,
31 percent of the cases were the CC genotype, versus
23 percent of the controls and then conversely in
the GG genotypes, very high chi-square now with
two degrees of freedom because you have three risk
groups, 10 to the minus 14th. You can calculate a
heterozygote odds ratio, which would be the risk of the
heterozygote group to the ancestral allele, and a
homozygote odds ratio, the risk of the homozygote
variant to the ancestral allele. I might mention we’ve kind of
gotten away from the terms wild type and mutant,
wild type because one could imagine, you know,
Martha and George sitting in a doctor’s office and the
doctor saying, you know, “George, you have the wild type
of such and such,” and Martha saying, “I never knew you were
such a wild type myself.” But, at any rate — and mutant
because it carries certain connotations, and so we prefer
to refer to either the ancestral or common allele and
the variant allele. And what one does then is
just calculate these — all of these chi-square values
or whatever association statistic you have across
all of your SNPs, so, you know, 300,000,
500,000 times. And as I may have mentioned
earlier, because DNA is a linear molecule, you can
just start with the P end of chromosome 1. P — remember the chromosomes
have two arms, P is the part above the centromere,
it’s usually smaller than the part below
the centromere. P is — these were named by
the French, so P for petite, and then Q for
the long arm. But anyway, you just kind of
line these up and here are your p-values in this particular
study by Bierut. They just plotted p-values
across the genome and you can plot them in many
different ways. This one, as I mentioned before,
the negative log of the p-value. This is the Klein study of
macular degeneration, which sort of really got
this whole field going in March of 2005. And here’s, you know, another
somewhat more colorful way, even more colorful way
of plotting them. You can plot them, you know,
multiple studies on a single page. In fact, the Wellcome Trust,
I think, was a — I don’t know, a $15 million exercise and this
has been called the $15 million plot, so at
any rate. And as Tom mentioned, this sort
of deluge of data, not only just data but
also positive findings, has been alluded to drinking
from a fire hose. David Hunter and Peter Kraft
at Harvard wrote this very nice summary on statistical
issues in genome-wide association studies, and they
conclude that there have been few if any similar bursts of
discovery in the history of medical research,
and I think that’s probably true over this
short a period of time. Lessons that we’ve learned from
kind of the initial burst of genome-wide association studies:
probably the biggest lesson has been we really don’t know much
about disease pathophysiology or biologic pathways. And these are just a few of
the signals in genes that nobody would have suspected
as being related to these particular diseases. Macular degeneration I already
mentioned related to complement factor H, which is part of
the inflammatory pathway. Macular degeneration was thought
to be an ischemic disease until this finding. Coronary disease related
to CDKN2A and 2B. These are cell cycle variants
that are actually related to cancer, wouldn’t have been on
anybody’s candidate gene list. Childhood asthma related to
RMDL3, which I’ll go into a little bit more. Type II diabetes with another
cell cycle variant. QT interval prolongation,
Tom mentioned related to a nitric oxide
synthase variant. There have been a number of
variants found in places where there really aren’t
any genes at all. And in the past when linkage
signals were found in these areas they were discounted. People — like, just like
finding linkage signals in introns, people said,
oh, that can’t be right or it must be a false positive,
we know there are lots of those and that, but the 8Q24
association in particular has been found over and over
and over and over again, and it’s, you know, it’s there,
it’s not going to go away, and it really is going to change
our understanding of the biology of the genome to understand how,
you know, a variant in a place that doesn’t have anything to
do with protein coding is so related to cancer. Crohn’s disease has several
variants strongly replicated that are in areas without
any known genes. And then signals in
common across diseases, so diseases people would not
have thought were quite too terribly related. Maybe diabetes and CHD you
might have thought were, but even when you control for
diabetes as a risk factor for CHD, these associations
remain. These variants, as I mentioned,
are related to cancer, particularly familial
invasive melanoma. Again, not something that
one would have expected. They’re also associated with
frailty, in a study from the U.K. Prostate cancer, breast and
colorectal cancer are all associated with this
8Q24 region. Crohn’s disease and psoriasis
do have some characteristics in common, so maybe that’s
not a big surprise, but I think Crohn’s disease and
type I diabetes were never expected to be related any more
than just in the immune — so, you know, the major
histocompatibility complex where they do share a locus,
but they also share one in this phosphatase PTPN2
rheumatoid arthritis and type I diabetes in
another phosphatase. So, lots of new things learned
from these kinds of studies that really are in many ways
kind of setting biology on its ear. Unique aspects of these studies
are that they permit an examination of variability
really at an unprecedented level of resolution, so much —
you know, down to the five to 10 KB region or maybe
even tighter than that. As Tom has mentioned a couple
of times, they permit sort of an agnostic genome-wide
interrogation, so you don’t have to be able to
identify your candidate genes in the places you
want to look. One of the nice things about
this is once you’ve measured the genome, you can relate
it to just about any trait. So, once you have your
genome-wide association study, if you have a cohort
study or a group that’s been characterized in extensive ways,
you can then relate those things to anything, whereas with
candidate gene studies you sort of had a separate set
of candidate genes for each trait. And as I mentioned, most of
the robust associations have not been in genes
previously suspected, association and not in regions
even known to harbor genes. But, as Hunter and Kraft point
out, the chief strength of the new approach is also
its chief problem. With more than 500,000
comparisons, the potential for false positives really
is unprecedented. So, we were worried about false
positives with candidate genes. Here, you know, it’s really
a huge, huge, huge problem. And, again, Gary Larson knew
this and said, “God, Collings, I hate to start a Monday
with a case like this,” and here you see this knife
sticking out of the back and the butlers of the world
annual convention. So, the challenge is, how do
you find, you know, the murderer from all of
these false positives. And there are a number of
ways of dealing with this multiple testing problem. Probably the most familiar,
the easiest to grasp and the one most commonly used
is the Bonferroni correction, which is simply dividing your
alpha level by the number of tests performed. There are a couple of
others that have been in the literature. Probably the second most common
would be the false discovery rate, which is the proportion
of significant associations that are truly false positives,
so it gives you a different denominator. The false positive report
probability of [unintelligible] at the NCI, the probability that
the null hypothesis is true given a significant finding. Logically, I don’t see much of
a difference between these two at least in the way they are
described by their authors, but they are different
mathematically, and are said to give you
somewhat different results. But, probably the best
approach is replication of a finding and replication
many, many, many times. So, in order to address this,
we recognized that there was a need to define sort of what
replication consisted of. This is a report from an NCI
NHGRI working group that was convened in November 2006. And, yeah — and there are
a number of ways of going about replication. This is sort of the design of
a study to do replication given an initial study and
wanting to have sort of your replication sets
all lined up. So, this was proposed by the
group who were studying prostate cancer
at the NCI. Bob Hoover may be a name that
some of you recognize. They were going to start with
1,200 cases and 1,200 controls but test, as you know,
500,000 tag SNPs, so using the Illumina
platform a very wide — a dense genome array. Then in their replication study,
they were going to actually expand the number of cases
and controls that they did and do a smaller number,
but still a fairly significant, a fairly large
number of SNPs, about 5 percent of them in their
first replication study. In their second replication
study, almost similar number of cases and controls, but
a smaller number of SNPs and you notice this funnel kind
of narrowing down in a third replication study down
to maybe 200 or so, and then perhaps ending up with
maybe 25 to 50 loci when they’re at the end
of it. When they actually did the
study they ended up with about five or
six loci. So, replication
is key. It’s important that
the initial study — one of the things we recognized
in the working group was you have to have enough description
in the initial study to be able to replicate it and many
times these genome-wide studies are very, very
poorly described. One of the studies of Crohn’s
disease famously described its cases as Belgian
and that’s all, you know, and so it’s very
difficult then to know how really to sort of try to
replicate such a thing. Participation rates and flow
charts of selection would be very useful to have in order for
an epidemiologist or others to know how selected
a population is. Methods for assessing affected
status very often are not described or are just sort
of it was a clinician — you know, clinician’s diagnosis;
table one sort of describing cases and controls,
how they compare and other factors,
rates of missing data, assessing population
heterogeneity, genotyping methods and
quality control metrics. Very often these — more and
more they are coming to be included in genome-wide reports,
particularly because for our replication working group we
had four journal editors as coauthors, so that
helped a lot. And then the replication study
should have a similar population, a similar phenotype,
the same genetic model, the same SNP in the same
direction — it’s amazing how some replication studies,
and I’ll show you some tomorrow, didn’t even find sort
of the same allele or the same direction of association and
still claimed replication — and then adequately powered to
detect the postulated effect. So, how was this done in some of
the genome-wide studies that are out there? One of the first and the —
or one of the earliest very big ones was this breast
cancer study from the U.K., where they actually started with
a very small — relatively small number of cases and controls,
about 400 cases, 400 controls, but they selected them to be
strongly familially loaded. So, I believe these women had to
have at least two relatives — first-degree relatives
with breast cancer. They tested 260,000
SNPs in them. Their stage two was
10 times as large, so 4,000 cases,
4,000 controls. 5 percent of their SNPs were
carried forward into stage two. Stage three was four times as
large — six times, sorry, as large again, brought forward
only 30 SNPs, and then finally came out with six SNPs that
were significant across all of these studies. And these are all of the cohorts
that they used to get to their 40,000 — 50,000 some subjects,
so a huge, huge, huge interaction, really a
global collaboration. One of the things that’s
important to screen out, too, or make sure you don’t miss
are the false negatives, shown here, “And now
Edgar’s gone. Something’s going
on around here.” So, you want to be sure that
you’re not missing even subtle false negatives and the
way this was approached in the CGEMS prostate cancer study
was to take a larger number of cases and controls and more
SNPs because they wanted to, you know, catch as
many as they could, but then basically took,
you know, 4,000 cases and 4,000 controls, so they
modified their design a little bit. They brought forward 5 percent
of their SNPs, as the Easton breast
cancer study had done, and they selected everything
at p less than 0.068. I’ve forgotten how they came
to this; it was some fancy false positive report
probability parameter. But, at any rate,
what’s interesting is when they then did their —
when they compared their sort of first and second stages,
and the way these studies are analyzed, they’re often
analyzed together in a joint analysis correcting your
p-values for that joint analysis, here are the SNPs that
came out and the genes that were associated, the p-values
from stages one and two, but what’s neat to look at is
the initial rank from the stage one study. This particular SNP was ranked
24,000, so it was way, way, way, way down, you almost didn’t
pull it up into the 26,000 that were tested, and similarly
these other SNPs that were strongly positive. You know, one of them was just
above — sorry, just below that and several of them were
not in — certainly in your top 100 and the initial
p-values even, you know — only this one was even probably
on anybody’s radar screen. So, recognizing that it is easy
to miss important associations as well and, again, these,
you know, these need further replication and further
investigation, but it is sort of a harsh lesson, I think,
that if you just carry forward your top 100 SNPs, you may be
carrying forward all your genotyping error and
not a whole lot else. Tom asked me to comment a bit
about genome-wide association cohort studies. This is beginning to be done
in cohorts that have been prospectively collected,
people either free of disease or population-based
samples with and without disease, however, you know,
they may come, and exposures measured
over time. Particularly in the Framingham
study, you may have seen this report in “BMC Medical
Genetics.” This is sort of the cover
paper that describes the 100-case SNP genotyping study
resource, and then there were 17 phenotype working group
reports published in the same issue. And Framingham has undergone
500K genotyping and that’s available in the Framingham,
it’s called the SHARe Resource. I’ve forgotten — SNP something
— SNP Health Association Resource, I believe,
is what it stands for. Those data are also available
through dbSNP — I’m sorry, dbGaP through a controlled
access process. And the women’s health study,
which is one of the Harvard women’s cohorts, has also had a
genome-wide association study done in 25,000. These are — these women
were nutritionists, dieticians and physical
therapists, I think. They were health professionals
but not nurses. So — and these data will also
be made available through a controlled access process. This is the dbGaP sort
of entrance page. The database of genotype and
phenotype was developed by the National Center for
Biotechnology Information in sort of recognition that
genome-wide association was coming, that these data needed
to be made available in a way that could be managed
responsibly and still be accessible, and so it was
developed basically as the Framingham resource and the GAIN
study, which I was involved in, Genetic Association
Information Network. We’re moving forward, we kind of
developed together, developed policies and
developed the resource. And you can see that,
this may not be the latest screenshot from here,
but Framingham SHARe is certainly on this if
you click on it. And I don’t have a screenshot
from Framingham SHARe, but going down to the ADHD
study, which I know a little bit better, there’s a
description of the study that actually can go on
for quite some time. This is a syllabus
or a summary of it. You can search within
for various things. You can also look at particular
variables, details are — Vs are for variables,
D, I think, is for data that’s available,
et cetera. You can also ask for information
on how to apply for individual level data and if you have an
ERA access number at NIH, you’re eligible to apply. Anyone who has submitted a grant
should have an ERA number. You have to get certain
credentialing through your institution in order
to get such a number, and we felt that that was kind
of the best way of credentialing a requester so that we
didn’t have, you know, kids in garages requesting
data and that. It’s not that hard to
get an ERA number. There are some steps that
you have to go through, and so if you don’t have one and
you want access to these data it is possible to get one
but it is a bit more of a credentialing step. And then there are sort of,
you know, how one gets to these and what the use
restrictions might be on a given dataset so that
you’re aware of those. And then maybe just to finish
up, and we’ll let you out a little bit early
on this lovely day, genetic association
and clinical trials. There has not been as much work
done in clinical trials on genes and genetic associations,
certainly not as much as has been done in observational
studies, even though most of that was candidate
gene work. Some interesting stuff
that came out in 2000, the year 2000, on beta
adrenergic receptor polymorphisms in response
to albuterol in asthma showing a really pretty
profound association with a particular variant,
the 16 arginine glutamine variant associated
with actually worsening pulmonary function if you
had the variant allele. TCF7 polymorphisms,
the F2 polymorphisms, this is that variant in diabetes
that I mentioned earlier, has been typed in the Diabetes
Prevention Program. Diabetes Prevention Program
was that clinical trial that looked at incidence of diabetes
in pre-diabetics, so people who had elevated
blood sugar or, you know, impaired glucose tolerance
or obesity. In three arms they were
randomized to physical activity, I believe, increased
physical activity and diet control, Metformin, and then
sort of health advice and showed that the increased
physical activity and weight control actually was the
most effective there, and in fact the TCF7L2 and a
couple of other genes have been genotyped in that study
showing basically that the interventions work differently
in some genotypes, not for TCF7L2, but for
some of the others, and it’s a nice way actually
to make use of clinical trial information to see if you can
find variants that affect or interact with various
treatments. And this was also done
in the ALLHAT study, a paper from Boerwinkle and
Arnett and their colleagues looking at a variant in —
I’ve forgotten what NPPA is, but at any rate, this was
just recently published in January of 2008 and it is,
again, a way of looking at — making use of clinical trials
that have DNA available to test genetic associations. There really have been only two
that we could find genome-wide association studies
in clinical trials, one being this very small one,
looking at hepatic adverse events, elevated transaminases
in people who received an anti-clotting agent called
ximelagatron, which reminds me of sort of a Transformer
name, but at any rate, was associated with one
of the MCH DRB1 alleles. But when you look at what their
data consists of it seems to me that, boy, this is just begging
to be a false positive, but, you know, it needs
to be replicated. And then a second one,
a little bit more robust perhaps, looking at response
to interferon beta therapy in multiple sclerosis and showing
an association with a particular variant. So, to sum up then, candidate
genes association studies have been enormously prone to
spurious association and have, I think, received some
appropriate skepticism because of that. Genome-wide association really
provides a new paradigm unconstrained by our current
imperfect understanding of the genome. Initial findings have been
really surprisingly positive. You know, sometimes we sort
of pinch ourselves, are we awake, but they are
— they’re robust, they’re replicated and now,
you know, some important aspects of biology are
coming out of these, so they’re probably real. It’s beginning to be applied
in cohort studies; more needs to be
done in that. And very little work has been
done in clinical trials and treatment response. So, I think one of the key
lessons from this is that we need to get epidemiologists,
clinical trialists and geneticists together. And as my closing Gary Larson,
“What have I always said? Sheep and cattle
just don’t mix,” and you can see they’re
having trouble here. So, I think I’ll stop at that
point and be happy to take any questions. Why don’t you
go first? Oh, yes, you need
your microphone.   Linda, we’re going to give you
an alternative career here. So, are we — can we turn
the microphone on or — [Female Speaker]
[Unintelligible] [Dr. Teri Manolio]
Just a second —
oh, there you go. [Male Speaker]
I just want to make sure
I understood you correctly about genome-wide associations
and twin cohorts. Is it that — you know,
in a co-twin design you sort of toss out the exposure
concordant twins and the only thing that lends to the
association of the exposure discordant twins? So, if you take exposure
concordant twins and then their disease discordant,
does that speak to a genetic difference? [Dr. Teri Manolio]
It may, if you’re talking
monozygotic twins? [Male Speaker]
Monozygotic, yes,
absolutely. [Dr. Teri Manolio]
Okay, so exposure
concordant — [Male Speaker]
Disease discordant. [Dr. Teri Manolio]
Disease discordant. [Male Speaker]
Yeah, so now it’s like —
it sort of adds an interesting spin to the
co-twin design because, like, usually we think
about monozygotic twins as, you know, the association is
adjusted for genes, right, in monozygotic twins because
they’re matched, but now I’m not sure you’re
saying that with this epigenetic stuff they’re looking
for changes in the genetics. You know, if you have —
if you have exposed — in a twin pair if they’re both
exposed one gets the disease, one doesn’t, there must be some
genetic change in the other. [Dr. Teri Manolio]
Well, no, I mean,
it could be — if you have exposure
concordant twins — [Male Speaker]
Right, Right. [Dr. Teri Manolio]
And you assume that the,
yeah, the exposure’s exactly the same in both of them,
is that exposure then doing something to their genomes
in a different way — [Male Speaker]
Right, right. [Dr. Teri Manolio]
— I guess is what
we’re asking. And you know, I’m not sure I’d
necessarily put the constraint of matching on the exposure
since it’s so difficult to do anyway, it’s sort of ideal,
but epigenetic changes happen in, you know — they generally
are methylation changes and so they’re thought to be related to
sort of dietary folate and that, but nobody knows really how
they come about and where they go. So, maybe it’s that, you know,
both twins smoke and one of them, you know, has an
epigenetic modification from somewhere else — [Male Speaker]
Something else, yeah. [Dr. Teri Manolio]
Right, that would cause them —
maybe their [unintelligible] transferase gene is turned on in
one twin, which is important in metabolism of nicotine,
it’s turned on in one twin and turned off
in another twin. So, that might be a design,
you’re right, for finding some of those genetic variants
that, you know, are not on the — it’s not the sequence
itself but something related to the sequence, but there’s
also a lot of other variability in there to try
to tease out. [Male Speaker]
Okay, thank you. [Teri Manolio]
Yeah, twin studies are
challenging but there are real opportunities there and
we shouldn’t just discount them, recognizing, too, that even
dizygotic twins, you know, they’re matched for age,
which is great, they’re matched for a lot
of exposures and, you know, half the time they’re matched
for sex, so that’s cool. Sir? [Male Speaker]
Two questions. One is a really stupid question
to show my ignorance — [Dr. Teri Manolio]
There are no stupid questions. [Male Speaker]
But then let me ask a
non-stupid question. Well, recently we received
some — well, sort of a question from the — the HoBi
project officer to reconsider — [Dr. Teri Manolio]
I’m sure they were
brilliant questions. [Male Speaker]
No, no, no, this
isn’t brilliant. [laughs] To reconsider — you just showed
that for these genome-wide association studies or
different cohort studies, you know, like SHARe or
CARE [phonetic sp], they plan to put on the Web,
I mean, so people can apply to do the analyses
and so on. So, one of the things
that concerns a lot of investigators is the
confidentiality, and in the past we received
a letter from that office saying this is
not a human — [Dr. Teri Manolio]
Not human subjects research. [Male Speaker]
— their issue, but now it
sounds like there is some different considerations on
that, that they consider this letter is no
longer valid? [Dr. Teri Manolio]
You know, that letter has
been very controversial from the beginning. So, so his was something that
was put out by the Office of Human Research Protections
in August of 2004, and it was their finding that
basically if data are de-identified, and one
can debate as to what de-identified is, but what —
their definition of de-identified was the 18 HIPPA
identifiers, which, you know, age and birth date and
stuff like that. If those were off, then this was
not human subjects research. Now, that was a guidance that
was put out to IRBs and it really was up to an IRB as
to whether or not to accept that guidance. Most IRBs have accepted it,
and have basically said this is not human subjects research. The institutes at NIH or
elsewhere can’t tell an IRB what to think
and how to act. Some institutes have asked
the IRBs, “Are you sure? You know, “Do you really feel
that this is human subjects research or it’s not human
subjects research or whatever,” but others have just accepted
whatever an IRB says. So, I’m not sure what question
you got back from — [Male Speaker]
Well, ours is saying that
it’s not human subjects so they actually approved
[unintelligible] approval rather than going through
the whole panel. [Dr. Teri Manolio]
I think it’s important, too,
not to perceive or portray what’s going on in dbGaP as
posting the data on the Web, because it’s
not that. You know, what’s posted on
the Web is a description of the study. The data still have to be —
you have to go through a fairly complex process in
order to receive them, and you have to agree to keep
them secure and to maintain confidentiality and not send
them to anybody else, not try to identify anybody
and there are, you know, some fairly significant
sanctions for not doing that that have to do with your
relationship with the NIH. You know, people could sue
you and that sort of thing, but it’s much more a matter of
if NIH were to find out that you had misused these data,
you’d never get any more, that’s for sure. And you’d probably have some
difficulties with other aspects of interactions
with NIH. [Male Speaker]
Now the stupid question. [laughs] Well, with the genome-wide
association studies sometimes you’re just identifying
the region, right, the region that’s associated
with certain diseases. So, I remember last year in
“Science,” I think there were five studies that confirmed —
that showed certain regions that associated with MI,
like the risk increased by 23 percent
or something. [Dr. Teri Manolio]
I thought you were going to
go with the obesity one. I’m not sure about the —
with the — [Male Speaker]
Yeah, I forgot the exact —
but I wonder if you identify a pretty narrow region,
why can’t you just identify the genes? [Dr. Teri Manolio]
Well, you know,
some regions — [Male Speaker]
I mean, just bite the
bullet and do it. [Dr. Teri Manolio]
Oh, sure, and you know,
in some regions that you identify there’s just one gene,
and isn’t that wonderful and you say, “Oh, there’s just
one and it must be,” like I showed you there,
you know, the IL23R. “Oh, that must
be it.” Well, yeah, but, you know,
we also know that there are some regions where there aren’t
any genes and you’ve got an association, so maybe it’s
something about the way that region interferes with the
way something else happens in the genome. Or there are regions that
have two or three genes, and I’ll show you an example
of that tomorrow. And so, so you know, trying to
really figure out which variant it is that’s causing your
phenotype is a real challenge, and so you go through
various steps of trying to, you know, identify what’s
in the neighboring region, what’s it in linkage
disequilibrium with, is it conserved across species,
so did evolution somehow think it was important to
keep it the same? And if it’s heavily conserved
that suggests that it’s pretty important. And then there are other ways
you can knock it down. You can sort of give interfering
RNA and see what affect it is if you reduce the
function of that. You can see if
it’s expressed. There are a variety of
ways of testing that, which I’m not a molecular
geneticist so I can only, you know, kind of skim the
surface of that for you, but we’ll go into that
a little bit tomorrow.   [Male Speaker]
I think your
[unintelligible] Korean general
epidemiology study. So, you know, it’s a very big
study and it’s still ongoing. We already collected from
more than 20,000 people but it is still ongoing. And the last several years we
had several debates about study design and most was the
[unintelligible] sample size and all kinds of phenotypes
we had to measure, but I think we had little
attention about the representativeness of
study populations. Some of the centers are
recruiting members from cancer screening centers or
even from some hospitals because it is easier
and costs less, but I’m not sure about their —
about the representativeness of our general
epidemiology study. Can you explain a little
more about the importance of representativenes of the
general epidemiology study? [Dr. Teri Manolio]
Yeah, well, you’re probably in
the center of places who know about representativeness
and the importance of it in epidemiologic studies. I think I can comment on it for
genome-wide association studies. It doesn’t seem to be quite so
critical at least for the genetic variants we found to
date because they seem to — you know, usually what you
expect — either you miss a whole bunch of stuff because
you bias yourself toward the null, or you find a like of
spurious things because it’s biased. And yet, you know, the one
study that I think we would all have said, “Don’t do it
this way,” was the Wellcome Trust Case-Control study where
they used blood donors basically as part
of their controls, and then the ’58 — 1958 birth
cohort, sort of the survivors of that birth cohort,
as the other half of their controls. None of them, you know, were
in the same cities and the same places as their cases and
they hadn’t been ascertained in the same way and probably
that’s why they didn’t find an association with
hypertension, because hypertension was likely
very common in the controls as well and they hadn’t phenotyped
their controls. And yet the associations that
they found have been replicated again and again
and again. So, maybe if all you’re looking
at is that DNA sequence, maybe it’s not so bad,
but once you want to get out into either
understanding gene function or understanding,
you know, gene environment interaction and how those
associations are modified, how they change over the
lifespan, then I think, you know, you’re sunk
with a study like that. So, that’s where representative
helps you. [Dr. Thomas Pearson]
We’re going to talk about
some of the issues of bias, which the non-representativeness
of the cohort could be any one of a bunch of them. But, I agree with Teri
that the current — it’s been amazingly robust
to identify some of the polymorphisms as it was,
but I think if you were to, like you’re doing with
the study that large, you’d have the opportunity
to have the power to look at quite a robust view of the
genetic causation of diseases. Kind of what my reading
of literature is, is that we’ve been kind
of skimming across the causative genes and identifying
perhaps the prevalent and — and large odd
ratios one. And the more — usually what the
bias does, as you all know, is biased toward the null,
so your sensitivity to find all of the genes may be
hindered or may be even made spurious by a non-
representativeness, so that as we go through
and look for the first cut major genes I agree with Teri
that there’s been amazingly non-sensitive to bizarre
groups being compared. But I wonder if we really want
to get down to saying this is really what the whole
biology of this disease is, and down to some of the very
small polymorphisms that you’re going to need to get
to some of those more large and elegant and well
represented studies. I mean, I think
it’s a — [Dr. Teri Manolio]
It’s a real sort of
creative tension, and when you talk to the
Wellcome Trust folks, you know, who designed this
study and they did it at a time — this had not
been done before. I mean, they really — you know,
there was the macular degeneration study and
that was basically it, and they just wanted
to find something. You know, they sort of said,
“We don’t want to find it all, we don’t even want
to find a majority, we just want to
find something,” and they, you know,
achieved that.

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *