Study Designs: Family-Based Studies

Study Designs: Family-Based Studies


[Dr. Thomas Pearson]
Well, I want to continue
some of the comments that Teri had made on
family-based studies, and really talk about the
family-based studies as the preamble and where the ideas
were that diseases were aggregating in families,
which then of course has led to the rationale for perhaps
less hypothesis-oriented studies like the genome-wide
association studies. So, I want to just talk a little
bit about study designs that generate or tests
genomic hypotheses, kind of in a broad more
philosophical sense as we start talking about
genetic associations. We are epidemiologists after
all, and we can associate almost anything. We want to describe the
major studies of science, which involve genetically
related individuals, and these are a
couple more. I want to talk a bit more about
twin studies and then get into something called trios that
I introduced with the Crohn’s study that I started
with, and give you some literature examples and then
talk about the advantages and disadvantages of these designs
in disease gene associations. But just to say is for many
diseases the associations in families obviously
came from the bedside, and William Osler talked
about familiar aggregation of coronary disease —
William Osler is the kind of father of internal medicine
and certainly around the turn of the century. Hippocrates talked about
familial aggregation, so this is just
something observed. And here’s an observation,
some of you know David Herrington. David is associate dean for
research at Wake Forest and was a fellow, like Teri,
with me a number of years ago, and he’s now doing quite a bit
of genetic epidemiology, and one wonders if this is
where he got interested. And this is a set of my
patients — this is actually, this paper is a series
of several twin pairs. And these two 51-year-olds
were Hungarian refugees. They crawled out from under the
barbed wire in Hungary in 1956 together, they went to
the same universities, they were electrical engineers
at the same defense firm in Baltimore. They smoked the same
brand of unusual Central European cigarettes. Their LDL cholesterol was
the same to the milligram per deciliter. Their blood pressures and
diabetes were normal. And EB here showed up at
Johns Hopkins Hospital with an inferior lateral
myocardial infarction and a little bit of ventricular
tachycardia and underwent cardiac catheterization and was
referred to me in our preventive cardiology clinic. Six months later, AB, his
identical twin brother, on a business trip to Detroit,
came down with an episode of severe chest pain and an
inferior lateral myocardial infarction and underwent
cardiac catheterization at the Henry Ford Hospital,
and I had the film sent to me. One of the striking things about
the films is that we had to put some extra tape on them because
they were identical. In other words, if you mixed
them up you couldn’t tell who was who. Those of you who aren’t
cardiologists might not know right versus left
dominance coronaries, but the left dominance is the
minority of individuals — both of them had a left
dominance — their right and left anterior descending
coronaries were normal, and they both had
a single lesion, a 90 percent stenosis
[phonetic sp] and at the same place of obtuse marginal
branch of the left circumflex coronary. So, what is this? What is this an
example of? Well, this is an example of
publication bias, okay, because obviously this
is very interesting; it’s an anecdotal case. It really doesn’t prove
anything other than the possibility of — is as if
you have identical genes, identical behaviors, possibly
genetically identical organ structures, that you could
have identical structural physiological and
clinical disease. And this has been the basis
obviously for interest in families and interest
in family studies, but I still think that
the reason that this was published was somewhat of
a bias of confirming what we already suspected. But it certainly
was interesting and certainly got the idea that
there is — did have something to say about the genetics
of coronary artery disease. So we epidemiologists have been
obviously studying disease for a long time, and particularly
relating it to altered physiology like high blood
pressure or gene products like LDL cholesterol,
et cetera. And what the whole opportunity
for gene association studies is is to go upstream to look
up to, not only some of the physiologic or protein
products of genes, but actually the expression
of those genes and the polymorphisms that lead to
either structural differences or levels of product
differences. And so the point is that a
lot of our gene association studies obviously start with
phenotype, but then start to explore things along this
pathway, obviously ending up with gene variance with
the new methodologies that Teri’s talked about. But the point is is it is
a logic regression of our epidemiologic activities and so
this is, again, another reason why I think this course
on population genomics obviously is key to really the
next generation of epidemiology. So there’s a variety of
questions then regarding the genetic etiology
of the disease. Covered here is, does it
aggregate in families? Is it inherited from
parent to offspring? Which chromosomes would
carry the disease gene? Which specifically genes are
associated with the disease? What gene variants are
associated with it amongst the genes? And what gene products are
altered as a cause of it? So a variety of questions to ask
to answer the real question, is there a genetic etiology
of the disease? What we’re going to talk about
here are twin studies, linkage analysis, just some
other comments on family-based designs,
but these have all led to identification possibly
of candidate genes, which, as Teri has already
described, have a relatively sobering track record in
terms of being able to be reproducible, although there
are certainly some major success stories. And then these genes were
tested for disease versus no disease, and would
replicate. What we have philosophically
with the genome-wide association studies — these are
given the adjective agnostic, because there really is no prior
hypothesis, no religious dictum in which to essentially
say that A causes B, but rather the entire genome is
tested for disease or no disease and then with the exertations
of experts like Teri Manolio have required really
replication so that we’re not totally confused
by a lot of alpha error. So, but just to in terms of
genome association studies on a philosophical basis,
we tell our students obviously that hypothesis-driven
research is the way to do things, you come up with your
hypothesis and you test it, et cetera, and the genome-wide
association studies with really a different philosophical
approach, I think it jarred us a bit, and there’s a
variety of subtle and non-subtle implications of this agnostic
approach that I think we’re going to continue
to talk about. But given their sway in terms
of state of the art in the whole gene association area,
I think it’s appropriate to just ponder this really lack of
saying that we know what’s up here and we’re going to
test a hypothesis. This says we’re going to look
at a million polymorphisms and we’re going to find
out how it sorts out, and we’re going to talk
a lot more about that. Family history obviously is
an independent risk factor in many diseases, and obviously
we teach our students in epidemiology and the importance
of defining a positive family history, and obviously some
of these are self-reported versus verified. It’s important to specify
divisional elements: the age of onset, the degree of
relatedness of the affected relatives, the number
of relatives. Our departed colleague,
Roger Williams, probably, I think, has done
the most elegant work looking within coronary disease of
relatives and onset of age, of onset, et cetera, and whether
it’s relative to coronary disease, but it does describe
the definition of a positive family history is perhaps a
little bit more subtle and complicated than sometimes
we give it. We also have to remember
family information bias, and all of us clinicians
have had a patient where, say, in the coronary care unit,
dad comes in with this myocardial infarction and the
question is has any other relatives had this,
and you have the interrogation of the entire
pedigree to the extent that any of us would have loved
to do in a field study, whereas someone without that
disease wouldn’t have that. So there is this family
information bias, the flow of family information
about exposures of illnesses stimulated by or directed to
a new case in the midst. So there is, perhaps,
the inflation of cases or perhaps the deflation
of cases in controls. The deflation of relatives in
controls compared to that of cases. Teri talked a little bit about
the relative risk ratio, again a measure of the strength
of familial aggregation, the prevalence of disease
and relatives the affected persons over that of the
general population, and here is a list of the
risk ratios for a variety of diseases. One of the recurrent themes
here is if you look at these with pretty sizable risk ratios:
look at autism here, these are the diseases now that
are showing up as the focus of genome-wide association
studies, and this has been the rationale for
targeting diseases, particularly some of the
psychiatric diseases, for example, are these
large ratios. It certainly is the preamble to
doing some more sophisticated studies to find out the
candidate to genes that are causing this. Now, siblings and first to
be relatives, obviously, if you have two alleles each,
what you have is about four chances out of 16 that these
two siblings will share that allele. You have four that they’ll share
neither of the alleles — they’ll be totally different —
and then the other eight will have that they share one or
the other of the alleles, and obviously this is the
Mendelian inheritance pattern that Teri was talking
about and allows us, obviously then, in family
studies to make all sorts of hypotheses. There’ve been a variety of
studies in epidemiology, as you all know, of nature
versus nurture. Migrant studies, for example,
are another group of studies which would be in this,
and twin studies, obviously, had their own
place in the development of genetic hypotheses. And if you look at the
genome-wide association studies, there’ll be frequent
citations of comparisons of monozygous and dizygous twins as
the rationale for such studies. So about .3 percent of births
are monozygous twins, .2 to 1 percent of
births are dizygous. Apparently this is quite
heterogeneous geographically with Africa having the highest
rate of dizygous twins and Northern Europe being below. Studies of twins reared apart
obviously test the nature versus nurture, and adopted twin
studies have also been useful. There’s also additional
studies of siblings, and we’ll maybe comment on
that a little bit later, so a variety of studies
in these groups. And there’s been measures
of familial aggregation in qualitative traits that term
concordance in quantitative traits correlation heritability,
and I want to just say a couple of those comments before we get
on to some of the study designs. The concordance is the number
of twin pairs with the disease among — it’s calculated to the
number of twin pairs with the disease amongst those twin pairs
with at least one affected twin. One would think that this
should be a two-by-two table, but obviously then everything
would be almost 100 percent concordance in an
infrequent disease. So you take the number of twins
with both affected divided by the number of twins who are
both affected and one if only one affected. If it’s less than 100 percent
in monozygous twins, you suggest you have
non-genetic factors, and if monozygous is
greater than dizygous, obviously, it’s evidence
for genetic factors, and the simplicity of this
whole thing, I think, adds to its being convincing. This is kind of one of your
classic studies where monozygous twins and dizygous
twins were looked at and the number of concordant pairs
really weren’t that different, but when it was then stratified
by less than 50 or greater than 50, you had 100 percent
concordance in early onset Parkinson’s disease compared to
greater than 50 age monozygous Parkinson’s disease,
and obviously now we recognize early onset
Parkinson’s disease as a distinct disease entity,
and one in which many of the genetic studies have focused —
so just the use of concordance. And again this concordance in
monozygous versus dizygous, obviously you see the large
difference in things like non-traumatic epilepsy,
schizophrenia, bipolar disorder, rheumatoid
arthritis, psoriasis, inflammatory diseases,
structural diseases, lupus. Again, if you go down the
list among the first 109 genome-wide association studies,
virtually all of these have been targeted, and this has been
the rationale for the studying of those first halves
having a genetic origin. Another opportunity is to look
at qualitative traits — and Teri’s going to talk some
more about qualitative traits as well — but obviously you
have your nice bell-shaped curve or possibly a skewed or
maybe even a bimodal curve in which you can
study variance, but frequently — and a number
of genome-wide association studies have done this —
have looked at say the upper seven and a half percent
versus the lower seven and a half percent looking for
differences in gene associations with these
qualitative traits, and in this instance,
quantitative traits — I’m sorry — in correlation
inheritability would be opportunities there. This is Manning Feinleib’s
study of blood pressure, again, despite systolic blood
pressure having had some tough sledding in the genome
association world in terms of identifying polymorphisms
related to it, there remains this correlation
of blood pressure suggesting monozygous twins are much more
strongly correlated in terms of their systolic blood pressure
than either dizygous or siblings, and parents and
offspring correlations and certainly more than
you would from spouses, which would be a suggestion
of environment. So certainly a way to look
at the twin pairs from a quantitative trait basis. Heritability has been
mentioned — again, that’s the variance in dizygous
pairs minus the variance in monozygous pairs divided by the
variance in dizygous pairs in twin studies, and it’s the
fraction of the total for the phenotypic variation of this
quantitative trait that’s caused by genes. It varies from zero to one,
and if it’s greater than .7 or .8, it would suggest that
there’s a strong influence of heritability, so as you read
the literature you have an idea kind of what
this means. The limitation of twin studies,
obviously, is that environmental exposures may not be identical
even in monozygous twins, or there can be a very highly
similar exposures and maybe that can be almost as confusing
as a reason for association. There can be some differences
in gene expression. There may be some heterogeneity
of the genotype between twin pairs as some suggestions from
why some of the twins have the differences they have,
and then there’s this concern about an
ascertainment bias, in which a co-twin with the
disease is more likely to participate in a twin study than
the co-twin who is unaffected, and so just a concern that you’d
want to have good participation rates from all the twins asked
to participate in a twins study. We now want to talk a little
bit about linkage analysis, and this is a family-based
approach to the identification of susceptibility genes,
or at least starting to be able to locate them
on the chromosomes, and where they might
be in the genome. Linkage is the tendency for
alleles at one locus — loci that are close together
to be transmitted together in an intact unit or a
haplotype and this has to do with recombination. The further apart,
as Teri mentioned, the further apart the genes are
the more likely over time is is that they’ll be a recombinant
in which they will end up on different chromosomes. So we try to measure this
frequency of recombination with a recombinant fraction —
this varies from 0 to .5, with 0 being tightly linked,
no recombination, these two genes are always found together,
and 0.5 is they’re unlinked and totally independently
associated — and the distance between them is
then given in centimorgans — this is a map distance rather
than a geographic or physical difference — it’s the
genetic links over which one recombinant crossover will
occur in 1 percent of meiosis, so it gives a certain
genetic length. And again, as Teri’s mentioned,
there are recombinant hotspots, et cetera that would obviously
decrease that — or look like it was
even further apart. So as you go down generations,
if each of these is a generation, obviously you
have recombinant events, and so that the genes with very
little recombination obviously end up as being very closely
clustered together, in which the linkage
disequilibrium suggests that they’re being passed on
together and are physically associated as one goes down
through these generations. So this whole idea of linkage
disequilibrium, obviously, is taken advantage of with
the recombinant fraction, obviously the extent of
recombination obviously is the function of the number of
generations and the recombinant fraction, so that ones that are
far apart and already not associated will become
completely disassociated in quick complete equilibrium
in very few generations, whereas those with a very
high recombinant fraction — or very low recombinant
fraction, that is, the fate of being almost zero, may over
many thousands of generations never really come into even
close to equilibrium of them basically disassociated. So this is the background
of linkage analysis. So in looking at linkage
and family studies, you would assume a mode of
Mendelian inheritance, [unintelligible] dominant,
et cetera. You would identify markers with
no impositions to serve as the references, and then you would
determine the number of first-degree relatives
who show recombination, assuming different
values of theta. And then this LOD ratio
that Teri had introduced, it’s the ratio of likelihood
of observing the family data that you observed up here with
the various values of theta to the likelihood of observing
the family data if the loci were totally unlinked. So, you take the family
data from up here, make certain assumptions about
what the recombinant fraction would be, and then assume
it relative to no — a theta of zero, that is a
complete disassociation. So this LOD score, the logarithm
of the odds, or Z, is the likelihood of the data
if the loci are linked at a particular level of theta versus
the likelihood of the data if they’re unlinked, so the best
estimate of theta is the recombinant frequency between
the marker locus and the disease locus, and the magnitude
of Z really identifies which of those likelihoods
is the greatest. The LOD score of greater
than three is essentially a thousand to one odds that the
loci are linked at that level of theta, and LOD scores can
be added across the families. So what you’re trying to do
is essentially within these correlation — these
linkage blocks, allot blocks of linkage
disequilibrium, identify the likelihood is
that your gene markers, be one of them over whatever
are linked together, the likelihood of that occurring
versus them perhaps being in another block of markers,
and that ratio of the odds is the LOD score and gives you
an idea of which ones were or were not linked. So I think the thing to remember
in terms of reading about these is obviously a LOD score
of greater than three would be what you would be
interested in identifying as something that are physically
linked on the genome. Now, the first example I gave
was talking about trios, and this is a study design,
which is a little different that we frequently use
in epidemiology, that’s the affected offspring
and both of their parents, and basically that’s all,
that’s what the trio is. There’s not unaffected offspring
or other individuals and the phenotypic assessment only is
in the affected offspring. The genotyping is in both
parents and the affected offspring, so you spend your
money on phenotyping the children of the parents and
you spend your money on the genotyping of all three,
so it’s a relatively efficient in terms
of expenditure of phenotypic resources. These are used in both
discovery and replication GWAS, and you can come up
with examples of both in which a trio design was used,
say, to identify from, say, 500,000 SNPs identify,
say, a smaller number — 20,000 or so — that then
could be put into a case control design. Probably more frequently would
be your typical sequential design of a GWAS study,
we’ll say a half a million SNPs then replicated with one or
two case control studies with at one of those phases a trio
being involved because of some of the advantages that trios
have it’s really a different, it’s kind of a different study
design, but it’s also not susceptible to population
stratification, which we’re going to talk
about tomorrow. And this is, this kind of
confounding is to the sampling of cases and
controls of populations of different ancestries. Well, clearly in a trio you
know who the ancestors are; you’ve got the two parents
and you’ve got the affected and so this is, this is
not a problem in trios. And so as one component of
a multi-stage genome-wide association study, this could
have some advantages. The test that is done then is
tests whether any given allele at a given locus is transmitted
to the affected offspring by parents more frequently
than expected by chance. The chance would be 50 percent,
so heterozygous parents would transmit the alleles at a given
locus in equal frequency, so 50 percent frequency
of any given allele, being one of the two
alleles of the child, and affected offspring should
receive the disease-associated allele more frequently,
and therefore there’s no need for a
control group. And this is called the
transmission disequilibrium test, TDT. So, here’s a study of type 1
diabetes with this particular allele, and what you have are
probands who are not affected with diabetes and those families
which have a proband, a child with affected diabetes,
and in these children there’s this many transmitted
and this many not. This should be 50/50,
and you can see it’s almost 56 percent in this
non-affected group; you can see it’s basically
one to one, 50/50, these are not significant
from 50 percent, these are highly significant
from 50 percent, suggesting that this particular
allele is transmitted more frequently to an affected
families than by chance. So this TDT is again a little
different study design than we have in other parts
of epidemiology, but I think still quite an
efficient and useful one. It gives you very similar data;
this is another type 1 diabetes study from Hakonarson, and
here’s actually three SNPs. Again, with a case control
study done as part of this, here’s the allele that was
looked at and the minor allele frequency. So in this instance what you
have is your case control study, your minor allele
frequency here is greater than your controls, gives
you an observation of .8 in the P-value. In this instance, the controls
have a higher minor allele frequency than the cases,
and so this allele is a protective allele:
gives you an odds ratio of less than one,
et cetera. And if you look at then within
a subgroup of the study, a second phase of this study,
again you had still the comparison of the same allele
with the, say, the wild type allele, the major allele,
and you can see the transmission here rather
than 50/50 should be — was much different
than that here. It’s the other way because it’s
protected, and so you can see in this particular study where
they did both of these in different subgroups a very
similar kind of information from the case control study
to the trio study. So, I think if I were to design
perhaps the ideal genome-wide association study,
it would be nice to have one of the replications
perhaps as a trio because you’d obviate this risk of having
population stratification. Now there are some
limitations. Obviously, one of them is it’s
difficult to assemble trios if there’s a late onset of disease
in the affected child. Obviously you need the parents
and so if you have a late onset of disease, you’re going
to have some difficulty assembling the trios. Secondly, and more subtly,
they’re sensitive to small degrees of phenotyping errors
in which the transmission of the proportions between parents
and offspring gets distorted, and there’s actually one of the
109 GWAS that I’ve reviewed, this study by Kirov about
schizophrenia is an example of that, where it appeared
that they actually handled the genotyping different in the
parents than in the proband and came up with all
sorts of distortions, which is described in this
paper, so there are some disadvantages of this. There are some other issues
to talk about with family-based designs. There also have been genome-wide
associations in affected and unaffected siblings,
and kind of a TDT has been used to
analyze those. An area that I find very
interesting is is trying to account for the heritability
or genetic risk. In other words, if you have a
positive family history and you add the genes and the
risk factors to it, can you account
for it? This kind of gets to the
question of when are we done? In other words, how many gene
variants do you have to study before you say I’ve accounted
for the genetic aggregation of this disease? For example, this would look
like, say if your multiple logistic equation had a term
with positive family history and say this gave a likelihood
ratio of, you know, say less than what the relative risk of
two or three or something, what would happen if you added
the various polymorphisms to that? And there have been some of
the studies have done this and talk about the percent
of familiar risk, which is accounted for
by these gene variants. There also could be, obviously,
the multiple adjustment of intermediary risk factors to
identify risk in first degree relatives and this obviously has
been a lot of discussion in the Framingham risk study in which
their initial discussions showed relatively little
predicted value of family history after the adjustment for
cholesterol and blood pressure, et cetera. This has reemerged with perhaps
more precise risk-factor data from the multiple generations
now of the Framingham risk, and so I think this is —
this continues to be an interesting area. This is a study that I’ve been
involved with for a while. This is the sibling
study with Diane and Lou Becker [phonetic sp] at
John’s Hopkins Hospital. We started enrolling siblings,
30- to 59-year-olds of patients, siblings of patients with
coronary disease with onset of less than 60 years,
and then following them forward for incidents
of coronary events. Turns out that their 10-year
risk is about 20 percent overall, so it’s a relatively
high-risk group just on the basis of them being brothers
and sisters of early heart disease patients. We also calculated the 10-year
risk from the Framingham risk score was calculated
at baseline, and these individuals,
particularly the men, had a 66 percent excess
risk than would have been predicted by the Framingham
risk score at baseline; women were closer, only about
a 12 percent increased risk; but the suggestion is, is that,
at least in this group, that the Framingham risk score
in siblings really falls short, particularly in men,
and there’s some additional things there in which
I think we would suggest would be genetic. So, in conclusion, I think
family-based studies have been the cornerstone of
identification and quantification of familial risk
in the heritability of human diseases, and, again, do provide
the rationale for getting into larger and more complex and
more expensive study designs. The linkage analysis identifies
the location of genes with known markers, and we’re going
to hear about the HapMap and other studies from Teri next,
and I did want to talk about this trio as a family-based
design that’s been used both for discovery,
or replications in GWAS and certainly in
candidate gene studies. But so the family-based designs,
I think, will continue to be useful. They’ve been, again, I think
incorporated with some of the genome-wide approaches,
but they still form, certainly, an important part
of the genetic epidemiology literature. Questions?   Bill.   [Bill]
So, am I correct if I think
of heritability as sort of roughly attributable risk
for all the genetic exposure? Is that a similar
concept or not? [Dr. Thomas Pearson]
Heritability is the percent
of variance explained. I think it’s kind of the
discreet versus continuous. I think they’re kind of
apples and oranges, but I think, for me,
the heritability has to do with quantitative traits
and the extent to which the variability and those
quantitative traits can be explained by
heritability. Contributable risk is what
proportion of those cases can be accounted for by that
gene, and so it’s, I think, just mathematically they’re
quite different from the get go. [Dr. Teri Manolio]
You know what,
I would agree. I mean if attributable risk
is a proportion of disease, it can be
explained. Really with heritability you’re
looking at the proportion of variability in disease,
whether they have disease or not, or the proportion
of variability in a trait. So, while they’re somewhat
related concepts, I think they’re not —
they wouldn’t map. [Bill]
Okay.   [Shelby Wong]
Hi, Shelby Wong,
from Children’s Memorial Hospital. I just wonder, for twins,
besides the utility for ethnic heritability, what other
utility could there be for genetic, either association
study or GWA studies?   [Dr. Thomas Pearson]
Well, I think there are many,
many opportunities within twins to study a great
number of things. Obviously the extent
of concordance and inheritability is of interest;
of also interest within the monozygous twins is — some
people call it discordance or lack of concordance,
because then you can start looking at if the genome
is essentially the same, but if the phenotype has
some differences to it, there’s the other flip side of
the coin in to look at what could have caused that. So in the same way that you’re
interested in twins because their environment and their —
is very much the same, and then with dizygous twins
versus monozygous twins you can look at the difference
in the genome. I think within monozygous twins
with difference in phenotypes you can look at the extent to
which there are a variety of not only environmental,
but I think some other issues like epigenetic and other
kind of post-genome things that have been going on. And this could get into things
like pharmacogenetics and a whole variety of things,
because you’re basically, you know, stratified by
the genome so you have a complete culture. [Dr. Teri Manolio]
One other thing that you can
look at is, as Tom said, sort of post-genomic
modification, so epigenetic modifications
occur by the environment. And there have been some nice
studies showing that epigenetic changes in identical
twins are very similar, at young ages very, very
different, and, you know, in their 50s or so;
it’s kind of the classic study. Also, a recent study of copy
number variance that many copy number variants actually
arise somatically, so sort of after, you know,
embryogenesis and that, and showing differences
between identical twins and the numbers of copy numbers
and that association with — I think it was schizophrenia —
with one of the psychiatric diseases.

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *