Idea Transcript
'.
CLINICAL BIOSTATISTICS Alvan R. Feinstein
'j
(
U\
CLINICAL BIOSTATISTICS
CLINICAL BIOSTATISTICS Alvan R. Feinstein, M.D. Professor of Medicine
and Epidemiology,
Yale University School of Medicine,
New
Haven, Conn.
The
C. V.
Saint Louis
Mosby Company
1977
Copyright
©
1977 by The C. V. Mosby
Company
No part of this book may be reproduced any manner without written permission of the publisher.
All rights reserved. in
Printed in the United States of America
Distributed in Great Britain by Henry Kimpton,
The
C. V.
London
Mosby Company
11830 Westline Industrial Drive,
St.
Louis, Missouri 63141
Library of Congress Cataloging in Publication Data
Feinstein,
Alvan
R
Clinical biostatistics.
Originally appeared as essays in the journal Clinical pharmacology and therapeutics.
Includes bibliographies.
—
Medical statistics Addresses, essays, lectures. Medical research Statistical methods Addresses, essays, lectures. Title. [DXLM: 1. Biometry. I. 2. Epidemiologic methods. HA29 F299c] 610'.1'5195 RA409.F36 77-3703 ISBN 0-8016-1563-1 1.
2.
CB/CB/CB
—
—
987654321
For
my
mother, Bella, and
who have
my tcife, me
given
roots, wings,
and
love
Linda,
PREFACE When 1970,
I
I
began writing the series of essays called "Clinical biostatistics" in I would run out of material in about a year. I knew I had some
thought
unorthodox things to say and some unconventional viewpoints to develop, but believed the development would require only
more deeply immersed in biostatistical ideas, however, I At every level of contemplation ranging from the massive
—
to do.
large-scale clinical
control" studies,
standard deviation
a
scientific
problems
—the
world of
how (and why)
to
seemed beset with
biostatistics
had received unsatisfactory had been given more attention than the bio-. As that
those challenges, the essays continued to proliferate, until
now
logistics of
to the elaborate complexities of "retrospective case-
such apparently simple questions as
to
calculate
-statistics
trials,
I
I became constantly found more
As
six or seven essays.
because
solutions
the
began grappling with almost 40 of them have I
appeared.
While the essays were making
their
bimonthly and
later trimonthly
appear-
ances in Clinical Pharmacology and Therapeutics, readers of the journal were highly complimentary; and first
I
many urged me
to publish the series as a book.
resisted this suggestion, mainly because
I
have never liked
this
At
type of
autoanthology.
The author
renovated
not just an unaltered collection of previously published papers.
My
text,
of
book,
a
I
thought, should prepare a suitably
resistance to the suggestion eventually collapsed, however, under
pressures: time
and audience. As
I
kept wanting to prepare that
never finding the necessary time to do
so,
I
two
new
realized that the only
sets of
text while
hope
for
achieving a book in the imminent future was to preserve the original essays.
Furthermore,
many
readers kept assuring
remain intact
—that
might be
their "spirit"
me
that the original essays should
lost in a revision
and
that the
book
would be more enjoyable to read if it preserved the informality of the original prose. Adding to these incentives for an "anthology" format were the fiscal concerns of The C. V. Mosby Company, publishers of the journal where the "Clinical biostatistics" essays have appeared. The cost of publishing the book could be substantially reduced
if
the essays were maintained in their original
form, with each text and bibliography unchanged.
Accordingly, "Clinical logical
this
biostatistics"
book contains series.
a
collection
of
original
They have been rearranged
essays
from the
as chapters, into a
pattern that differs from the chronologic sequence in which they
first
VII
viii
Preface
appeared.
A few
of the original titles
and many of the identifying numbers
of
the essays have been changed to conform with the current sequence of chapters.
Otherwise, the texts and
lists
To keep
of references for each essay remain intact.
the book from being too large,
I
have omitted
a
few
of the essays that contained
quantitative surveys of the medical and statistical literature, digressions into the ethics of research clinical
and the teaching
introductory chapter that brief
of statistics, or specific critiques of individual
The remaining 29
investigations.
is
Followed by
have been divided into an
essays
five
major sections, each preceded by a
commentary
who have already developed an interest in may have arisen spontaneously; it may have been necessitated by the demands or comments of a manuscript reviewer; or it may have been provoked by the efforts needed to understand the many The
essays are intended for people
biostatistical
issues
The
interest
mathematical machinations that are used I
assume that the reader
aware
is
wise not particularly adept that assumption, the goal
in
is
in
published reports of current research.
of rudimentarv statistical tactics, but
mathematics and
to enlighten
is
and perhaps
other-
is
possibly frightened by
With
it.
with the style
to entertain
of an essay, not to educate with the formality of a textbook.
Conventional textbooks and courses in biostatistics are usually devoted to the theoretical processes that
produce such mathematical calculations as P values,
confidence intervals, correlation coefficients, and regression equations.
mathematical emphasis, almost no attention has been given to the basic
Amid
procedures used for planning research, obtaining data, and analyzing The aim of these essays
is
to provide supplemental reading for the
tant topics that are omitted
the
scientific results.
many impor-
from conventional textbooks, and also some remedial
reading for topics that usually receive inadequate consideration.
Because of the way the book has been assembled, for
which
I
apologize.
The
first is
it
contains three features
that the text regularly contains references to
previous or forthcoming essays in the originally published series. Although useful liaisons for essavs that
were dispersed
in time,
many
of the references will
now
wrong places in the rearranged series. Since the current text is identical to what originally appeared in the journal publications, these references could not be changed. The second flaw is that the original bibliographic citations have also, of necessity, been preserved at the end of each essay. This process, while making the citations easy to find, produces frequent redundancy in some of the listings. The third infelicitous feature is that certain ideas are mentioned repeatedly in different locations of the text. The repetition seemed desirable in a appear
in the
succession of individual essays spread over a 6-year period, but
appealing
if
the essays are read contiguously.
I
hope that readers
may be
less
will find these
die repetitions instructive rather than irritating. n
discussing the various challenges and imperfections of biostatistics, to
I
have
keep the prose lively and have occasionally made it deliberately provocfost readers have said they enjoy this approach, but it has sometimes led accusation that I am antistatistical. This accusation has probably been
by anyone who has ever been discontent with the defects of any status the established tenets of clinical medicine and epidemiology, the
Preface
established creeds of statistics contain infirmities,
and
I
have always
I
would hardly want
clinical biostatistics
if
I
many
infirmities.
In pointing out the
tried to offer constructive suggestions for
spend so much
to
effort
working
improvement;
in the
domain
did not respect both the clinical bio- and the
of
-statistics
portions.
To do the kind had many sources
and writing that have produced these essays, I have support for which I want to express thanks. The Veterans hospital in West Haven, provided research aid for many
of thinking
Administration, at
of its
was Chief of the Eastern Research Support Center and, later, of the Cooperative Studies Program Support Center. For my activities at the Yale
years while
I
University School of Medicine, the National Center for Health Services Research
and Development supplied grants for several projects from which many of these emerged as by-products. During a highly productive period from 1971-
essays
1973, as a visiting professor,
stimulation from the
received professional hospitality and illuminating
I
Department
of Clinical
Epidemiology and
Biostatistics at
McMaster University Medical Center in Hamilton, Ontario, Canada. For the few years, the essays have been composed in my work as Director of the
past
Yale Clinical Scholar Program, which
is
sponsored by the Robert
Wood
Johnson
Foundation. I have been greatly helped by human and contributions. Before submitting the prepared essays for publication, have relied on thoughtful appraisal and stringent evaluation from critics who
In addition to this institutional aid,
talents I
are clinicians, epidemiologists, statisticians, or computer experts. In acknowl-
my
I also herewith absolve them of any Thev are Linda Marean Feinstein, Michael Gent, Charles A. Goldsmith, Moreson H. Kaplan, Donald Mainland, Walter A. Ramshaw, David L. Sackett, Helen L. Smits, Walter O. Spitzer, and Carolyn K.
edging
gratitude for their valuable help,
responsibility for the contents.
Wells.
I
am
also especially grateful to Dr.
Pharmacology and Therapeutics,
Walter Modell, Editor of Clinical
for his constant
encouragement and
for the
freedom he has provided. For excellent performance in the tasks of typing the difficult combinations of prose and mathematical symbols, I thank editorial
Elizabeth Tartagni, Carrol Ludington, and Pamela Rowe. Finally,
Daniel,
I
want
who have
my
wife,
Linda, and our children, Miriam and
gently tolerated the
many hours in which I was absent or and who have filled the nonwriting hours
to
thank
secluded while working on these essays, with warmth, affection, and
joy.
Alvan R. Feinstein
New
Haven, 1977
CONTENTS 1
Introduction and rationale,
1
SECTION ONE
THE ARCHITECTURE OF COHORT RESEARCH design of experiments, 17
2
Statistics versus science in the
3
Components
4
Intake, maintenance,
5
Subsequent implementation of the objective, 54
6
Sources of 'transition
7
Sources of 'chronology
8
Credulous idolatry and randomized allocation,
9
Consequences
of the research objective, 28
and
identification,
bias,'
38
71
bias,'
89 J 05
of 'compliance bias,' 122
TWO
SECTION
OTHER ARCHITECTURAL PROBLEMS
—and the responsibility
10
Statistical
11
Random sampling and medical
12
The
13
Ambiguity and abuse
14
The epidemiologic
malpractice
reality,
of a consultant, 137
154
rancid sample, the tilted target, and the medical poll-bearer, 169 in the twelve different concepts of 'control,'
trohoc,
the
ablative
risk
ratio,
and
186
'retrospective'
research, 197
15
On
the sensitivity, specificity,
and discrimination
of diagnostic tests, 214
SECTION THREE
PROBLEMS
IN
MEASUREMENT
16
On
17
The derangements
exorcizing the ghost of Gauss and the curse of Kelvin, 229 of the 'range of normal,' 243 xi
xii
Contents
18
How
19
The
do we measure difficulties of
'safety'
and
'efficacy'?
256
pharmaceutical surveillance, 271
SECTION FOUR
MATHEMATICAL MYSTIQUES AND STATISTICAL STRATEGIES 20
Permutation
21
The
22
Sample
23
Problems
tests
and
'statistical significance,'
287
direction of relationships, h\ pothcscs, and probabilities, 305 size in
and the other the
side of statistical significance,' 320
summary and
display of statistical data, 335
SECTION FIVE
THE ANALYSIS OF MULTIPLE VARIABLES 24
On
25
A
26
The purposes
27
The
28
Evaluation of a prognostic
29
Additional tactics in prognostic stratification, 430
homogeneity, taxonomy, and nosography, 353
primer of multivariate anal}
sis,
369
of prognostic stratification, 385
process of prognostic stratification, 398 stratification,
414
INDEXES Index of Authors, 447 Index of Methodologic Topics, 453 Index of Clinical and Other Practical Examples, 465
CLINICAL BIOSTATISTICS
CHAPTER Introduction and rationale The
"Clinical biostatistics" series began
who had
retired
when
I
was invited
from writing a bimonthly "column" on
and Therapeutics. The
first
essay in the series contains
to
succeed Dr. Donald Mainland,
statistics for Clinical
my
Pharmacology
tribute to Dr. Mainland's
many
previous contributions to biostatistics and aho describes the background philosophy with which the new series would be approached. The text was as follows.
—ARF
Donald Mainland can be succeeded but not replaced. His training, timing, and temperament have made him a unique domain of medical statistics, and a tough act to follow. In training, he was graduated in medicine with honors in 1925 at Edinburgh, where he was later awarded the Doctor figure in the
Science
of
degree
for
his
research
in
embryology and histology. After finishing medical school, he taught anatomy at Edinburgh for several years and then went to Canada. He worked at Manitoba from 1927 to 1930, when he left to become Professor and Chairman of the Department of Anatomy at Dalhousie University. His first publication in 1927 dealing with an uncommon abnormality in a muscle 13 was a harbinger of his subsequent concern with frequency distributions in biology. Within the next two years, he was
—
—
vestigations,
he contemplated methods
he applied his quantitative interests to measuring the forces of muscles 2021 and ;
then, in 1934, with a paper on
anatomist started his metamorphosis into medical statistician. By 1936, after some additional research on blood cells and blood counts, he had begun to write on "Problems of chance in clinical work." 23 In 1938, he produced his first book on quantitative medicine, 24
and twelve years continuous productivity in both
later, after
biologic
and medical statistics, York University's invitabecome Professor of Medical Staresearch
he accepted tion to tistics.
From
New
that position, with persistent
growth and enormous experience, he has continued
intellectual cal
enlightenment
vide
areas. 14 anatomic Later on, during various embryologic in-
students, consultees,
irregular
This chapter originally appeared as "Clinical biostatistics /.
A new name and some
—
other changes of the guard."
In Clin. Pharmacol. Ther. 11:135, 1970.
"Chance
and the blood count," 22 the quantitative
evaluating the accuracy of techniques for estimating
for
assessing the size and volume of cellular structures." 19 Over the next few years,
to
his
practito pro-
colleagues,
and readers. In timing, Dr. Mainland became
inter-
ested in biologic statistics during an era when the analytic techniques were in primitive
stages
semination.
of
conception
He knew many
and
dis-
of the early
Introduction and rationale
heroes in the contemporary
pan-
statistical
and he became a pioneer physician in developing the modern relationship between statistics and medicine. After the first edition of his classic book, Elementary theon,
Medical Statistics- in 1952. he continued to produce a powerful array of creative, didactic, expository, and polemic publications on the use of statistics in medicine. With his textbook, now in its second edition, 26 and his many other writings, he has probably contributed as much as any 1
single
person
the statistical
to
of clinical investigators in
sensibility
North America
today.
temperament,
In
he
has
managed
to
I would award a C, or perhaps a C+, but nothing higher," 2s or "I sometimes
today,
wonder how many more instances of stupiditv I might dig up from the days when was hypnotized by statistical techniques I
applied to pooled data." 30
My own
first encounter with Mainland about 1960, when I discovered his publications entitled "Notes from a Laboratory of Medical Statistics" a group
came
in
—
documents
of
cipients
cherished by the re-
still
who were
lucky enough to learn
about the "Notes," and to satisfy Mainland's hardy standards for the mailing list. ("There is a limit of 3,000 to the number of 'Notes' that can be issued.
we can no
We
.
.
are
preserve the extraordinary virtue of com-
sorry that
mon
that have disappeared after they have Agencies that require been received
to
sense, despite his constant exposure
arcane models,
the abstract concepts,
and
folderol
intellectual
statistician's
the
lurk in
that
world. Part of this virtue
is
attributable to Mainland's firm rooting in
the realities of medical biology.
not merely
preached about
research; he has practiced
development of maintained his vestigation items,
has
During the he
it.
his statistical interests, activities
in
—contributing,
biologic
among
textbook on anatomy 27
a
He
biostatistical
in-
other
—and
large-scale clinical research projects,
chiefly in
rheumatoid
But the greater part of Mainland's viris probably attributable to the man himself. Now near the age of retirement, he remains young in mind, in spirit, and in outlook. What other "older man," venerated and respected as he nears completion of his major work, is ready to recognize that "repetition of this theme during
two or three decades, by others as well as myself, has had very little effect" 31 to confess that he is "technically unsophisticated" 32 to solicit disagreement and rebuttals to all of his comments; and to be ;
;
receptive
for old problems.
to
new approaches
How many
established
enough to appraise work with comments like
"authorities" are brave their
these:
p;
ious
"C;ading
all
four
.
much
trouble
which Mainland issued periodically whenever he found time to do so, were the ancestors more recent "Statistical ward of the rounds" in this Journal, and the "Notes on Biometry in Medical Besearch," which have appeared under the sponsorship of with.")
deal
to
These
"Notes,"
the Veterans Administration. I
remember
still
the
enchantment of
discovering those early "Notes." For sev-
my own clinical reme increasingly in constatistical procedures, and my
eral years previously,
search had brought
arthritis.
tue
constantly
.
formal invoicing are also too
he
currently continues an active role in several
.
longer replace 'Notes'
items
together
tact
with
manuscripts
were being frequently sent
to statisticians for review. Since
the reviewers'
comments were
many
of
either clini-
absurd or statistically incomprehenhad begun, in self-defense, to I read textbooks on statistics. Like Mainland's, my education in statistics is largely self-acquired; but unlike most physicians, cally
sible,
I
was not intimidated by the arithmetic, I had done graduate work in pure
since
mathematics school.
was
What
before I
sometimes
found
entering
medical
the
textbooks
in
enlightening,
but
more
often appalling.
From my
previous
activities
in
pure
and in biologic science, I had become accustomed to a rigorous type
mathematics
Introduction and rationale
documentaany assertion. In pure mathematsuch an assertion was called a ics, theorem, and the rigorous documentation was a sequence of logically cohesive statements called a proof. In biologic science,
Aware
of either logical or empirical tion for
the assertion
was
and
called a hypothesis,
the rigorous documentation
was a
tion of empirical data called
observed evi-
collec-
dence. But most of the statistical textbooks
seemed
to contain neither a logical
nor an
empirical documentation for the assertions.
The
were often
texts
like
cookbooks, con-
taining a series of instructive recipes
how
on
and perform certain These instructions were seldom accompanied by a proof of their validity, by any references to where a proof might be found, or by any empirito tabulate data
of significance."
"tests
data to demonstrate that the proce-
cal
dures would remain valid requisite
when
sertions,
intellectual
I
was disturbed by the lack
of
consequences of the
component of "biostatistics." Here were men of high professional and intelcompetence.
lectual
How
could they so
blithely ignore the effects of their errone-
assumptions that most clinical data
ous
came from "random samples" with
"nor-
mal
vari-
distributions"
ables"? of
How
clinical
and "continuous
could they discuss the design
experiments
by extrapolating
from a brewery vat or an agricultural
conditions were violated.
From
give
would explore the
litera-
ture of mathematical statistics, looking for
many
biologic
to a
I
of the
real attention to the
their pre-
time to time,
some
of
problems that pervade work in clinical medicine, I had expected to find that the cerebral grass would be greener in the statistician's yard. To my dismay, I found manv weeds being cultivated and labeled as flowers. Apart from my dissatisfactions with the absence of proofs for didactic as-
field
human population? How could they so much emphasis to procedures for
purelv statistical analysis, while showing so
little
rigorous concern for such basic-
rational logic or the scientific
issues in scientific logic as specifying the
evidence to support what appeared in the
determining question, fundamental that answer whether the research would control appropriate question, choosing an group, checking the reliability of the data,
either the
"cookbooks," but
Even with
was seldom
I
successful.
a mathematical background, I
could not understand
many
of the esoteric
and my biologic background made me wary of the unrealistic assumptions that underlay many of the mathematformulations;
ical
arguments. did not
I
know
at the
time that some of
were
these mathematical defects
monplace
as
distinguished
so
com-
arouse public lament by
to
Harold
Said
statisticians.
Ilotelling 11 in 1960:
establishing reproducible criteria for sub-
ascertaining and jective evaluations, whether the investigated population was both homogeneous enough for everyone to be "lumped" together and selected in a manner that justified the idea of "random-
ness"?
Wandering among statistical doctrines seemed neither mathematically
that often
validated, biologically cogent, nor intellec-
The custom
omitting
of
proofs,
which would
beyond
not be tolerated in pure mathematics
very
limited
the
extent,
and
of statistics,
students
is
is
common
in
a
the teaching
excused on the grounds that
do not know enough mathematics
to
understand the proofs. Perhaps
a
better
reason
is
that
the
in
some cases and the
teachers,
authors of the textbooks, do not understand the proofs. in
some
because
wrong.
In
some
instances
the
instances
no proofs
exist,
no genuine proofs can
methods
taught
are
and exist,
demonstrably
I came upon Mainland's The man seemed to know that
tually challenged,
"Notes."
ought to pertain to biology, and he seemed to know about biology. He sounded like someone who had learned about research not bv aselomerating theories of probability, or massaging data whose origins he had never observed, but by actually feeling a tissue, handling an animal, calibrating an instrument, lookbiostatistics
ing through a microscope, or talking to a
Introduction and rationale
An
patient.
effective stimulant for the in-
torpor of the textbooks, Main-
tellectual
"Notes"
land's
made
biostatistics
vivid,
and exciting. He brought into open view many of the critical issues that lay hidden beneath glib traditional preconceptions; he helped demonstrate that many statistical models were inappropriate and misleading for biologv; and he provided a medium in which biologic scientists insecure and anxious in their heretical suspicions about conventional vital,
—
dogmas
statistical
—could
from seeing that other
comfort
take
and
scientists
shared the same heresies.
isticians
stat-
Here.
who
could talk
sensibly about clinical research.
(He could
at
was
last,
a statistician
sometimes talk too long, but verbosity is an accepted occupational hazard of biostatisticians. Mainland was readily for-
also
given for occasional ventures into prolix prose,
and
columns be equally
his successor in these
hopes that future readers
will
come and
increasingly involved in biostatistics,
as
accept
we begin
co-workers
tical
tribute
to appreciate its scope,
challenges, educate our statis-
its
creative
in its
problems, and con-
solutions
Donald Mainland our honored "founding lems,
those prob-
to
will
remain one of
fathers." He is a who helped establish the basic on which we must now build
physician
concept
the concept that biostatistics can best be developed neither from abstract theory in statistics nor from imprecise anecdotage in biology, but from a coordinated integra-
and
tion of perceptive observation
think-
ing in both. In succeeding Dr.
Mainland
as master
of ceremonies for these columns,
hope
I
preserve his basic outlook and philosophic standards, although I shall undoubtto
edly introduce some deviations of my own, because our specialized interests and training have been so different. His prestatistical domain was anatomy; mine has been
His basic work for
tolerant.
clinical medicine.
Mainland was the first medical statistician I had encountered who acted as though "bio" were an integral part of "bio-
most two decades has been centered in a department of medical statistics; mine has been (and remains) centered in a depart-
instead
statistics,"
of
a
prefix
attached
ment
intimately familiar with
occasional teaching exercise or a book in-
the
tended for graduate students in biologic domains. Since that time, I have met a few other statisticians who have truly become biostatisticians, but Mainland remains a pioneer both in migrating in the unusual profession direction from biology to statistics, and in exemplifying the modern fusion of biology with statistics. Clinical investigators today owe him an inestimable debt of gratitude for the contributions he has made to our domain by preserving thoughtful realism in our statistical
outlook.
Many
people believe that
book, Elementary Medical Statistics, could benefit from tighter organization
his
and greater
succinctness, but
it is
still
the
only Mich publication that gives at least as
mi
i
of meci
maneuvej
attention I
to
statistics
As
the medical issues as
clinical
to
the statistical
investigators
be-
aspects of
mathematical precepts of statactics; my acquaintance with
tistical
of these precepts
shall regularly ask
help
when
issues with iar.
many
basic
some for
He became
medicine.
internal
of
casually to "statistics" for the sake of an
al-
my
the
which
I
is
tenuous, and
statistical
discussions
am
I
colleagues get
into
relatively unfamil-
Since Dr. Mainland's packs of cards
and barrels of mitted
a
as
discs
legacy
have not been of
this
job,
many
probably use a computer for exercises in
random number
I
transshall
of the
selection that
he would have consigned to his trusty manual companions. I shall probably also call upon the computer for certain new activities
modern
that
it
now makes
possible
in
biostatistics.
One of the main challenges will be to keep the column as least as interesting, informative, and provocative as Mainland made it. Connoisseurs of the Mainland style will recall that he often goads his
Introduction and rationale
readers deliberately, hoping to ther discussion. (Example: "I I
have trodden on some
enough
.
.
.
corns hard
official
My own
tactics in
may be somewhat
provocation
.
produce a defense of their
methodology."- 9 )
.
.
foot-to-brain-to-hand
to initiate a
reflex that will
fur-
elicit
hope that
different
viously
by Dr. Mainland and by other
people
who have
biostatistics;
am
about
established
axioms,
comfort concepts,
have not been reguintensive scrutiny and
or other beliefs that
subjected to
larly
skeptical reappraisal.
To help augment the role of this column as a medium of vigorous communicative exchange and intellectual growth, I plan to proponents of their views or in rebuttal of mine, to become the "columnist" from time to time. The columns will be titled in a numbered sequence for my own papers, but a different designation will be used to accommoinvite various guests, either as
date other authors.
I
also
hope
that read-
ers will frequently write to express their
agreement or dissent about anything that appears here, and I would plan to have the letters (with the author's ted,
so
if
lively
desired)
discourse
medical
are
a
omit-
source of
Those
ers.
three sources of input have
first
already provided the topics planned for discussion in the next
open
are
few columns, but the and suggestions
thereafter,
be happily received.
will
One immediately obvious change the
Manv
of this column.
title
is
in
leaders are
"done in" by their successors, and Dr. Mainland should not have to worry about being blamed for my mistakes, misconceptions,
To
give him that and also to allow ward rounds" as a
or mischief.
freedom of
responsibility,
him
to use "Statistical
title
for a possible book, the
essays has statistics."
been changed
know
I
prefer the older
that
title;
name
of these
to "Clinical bio-
many readers will new one seems
the
more formal, somber, and sesquipedalian, but it was the least of the available evils in nomenclature, and I hope that these
new
"rounds"
informality
retain
will
the
and free-wheeling
appealing intellectual
fun of their ancestors.
future columns. The problems with which we
in
statistical
struggle
become
name
from
at the Veter-
hundreds of biomedical research tasks each year; and (4) comments from read-
slots
intellectual
work
ters, which are currently asked to help prevent or remedy biostatistical maladies
maintain
prolonged
clinical
stimulation
ans Administration Research Support Cen-
in
I
of us
all
(3)
projects encountered in
hope to need vigilant prodding to avoid or destroy complacency. The pace of science and technology has become too rapid for anyone to and occasionally inadvertent, but
preserve the principle that
about
written
new
too
numerous and too im-
There are more profound reasons, howchange that brings clinical into
portant to be resolved without an abun-
ever, for a
dance of argument. I hope that the arguments from all the people who contribute to these proceedings will be responsible, thoughtful, clearly written, and prepared in an atmosphere of light rather than heat but arguments nonetheless. Readers are invited not only to express opinions about what has appeared, but also to make suggestions about topics for future discussion. My ideas about choice of topics will come from several sources: (1) personal adventures during my own research activities; (2) review and occa-
juxtaposition with biostatistics. In quantita-
—
sional
revival
of
ideas
expressed
pre-
nomenclature,
tive
biostatistics
the
statistics
part
of
occupies ten letters and the
bio only three; the addition of eight clinical
letters
ance.
More
main of
many
to
the total phrase
nominal
restore
may
help
as well as conceptual bal-
importantly, however, the do-
biostatistics
is
currently beset with
intellectual maladies that I believe
can be remedied only if clinical biologists begin to make active contributions to the domain. These maladies, which arise not in the contents of statistical
the
way
statistical
thinking but in
concepts are applied to
Introduction and rationale
other
have recently become
disciplines,
comment by
public
of
subjects
leading
statisticians:
John
W. Tukey38
—
what was going on I was have been both ignorant and extremely superficial. It is this many-times-repeated experience that has led me to assert that mathematics has often chosen to ignore the careful examination and exposition of the methods it
explicidy to a machine
revealed
uses.
A
teacher of biochemistry does not find it intolerable to say, "I don't know." Nor does a physicist.
.
.
Why
.
should not
.
.
statisticians
.
Far better an approximate do the same? answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made more precise. .
J.
to
G. Skellam
7
.
.
have recently been
discontents
have been playing
role they
40
ever
years,
since
R.
more than
for
Fisher's 6
A.
epochal book made biologists begin the frequent search for statistical advice.
—
Valuable information which affects common-sense judgments tends to be ignored when formal statistical tools are employed along con.
.
Other
expressed by statisticians about the methods used to prepare consultants for the
Among
comments have been the
the
fol-
.
ventional lines.
.
.
Surprisingly
.
attention
little
lowing:
is
what is often a much more serious source of error and deception, the defects of the model itself. There is an important difference of emphasis between the application of mathematics to biology, and the mathematization of biology, and it is the latter which needs the
J.
most encouragement, for
or
normally given
G. Skellam' 7
—
to
.
.
...
difficulties lie.
.
I
it
here that the real
is
am somewhat
attribute
I
to the
way
(undesirable)
this
that statistics
is
largely
attitude
usually taught
mathematical discipline of great
—
as a
intrinsic interest
imparted to talented students who unfortunately have rarely had proper training in natural science
hand experience of
first
research.
scientific
disturbed by
thought
that the exalted status of mathemight possibly exercise ( an unintentional brand of tyranny over other ways
M. Zelen '—
of thinking.
most schools, seems so far removed from reality, that a heavy dose may be too toxic
the
matics
.
.
.
.
do not
.
!;
.
—
The
criticize statistical .
theory as such or .
.
The
trouble
is
methods are often used thoughtlessly
experimental work, for instance, a common abuse is to use a statistical test to try to "prove" .
a hypothesis.
.
.
.
The
scandal
nificant" results are published as
meaning. ... pose all kinds poor biologist although they
is
.
at
an institution or laboratory where active scienwork is being conducted.
tific
that the "sig-
though they had
on the or psychologist conditions which, produce unequivocal statistical reactually hinder him in his research. nonsensical
C. P.
Cox
1
—
conditions
—
If
the
identity
of
discipline .
.
and research
the
.
in
statistics
is
Complaints about the status quo have come from computer experts trving to implement some of the existing mathematical approaches and models. Said
also
of
its
statistics
"user" disciplines must be con-
tinually developed.
.
.
.
Besides training in
statis-
an aspirant statistical consultant should receive complementary and systematized, as distinct from casually acquired, training in the disciplines All the in which he is expected to consult. scientists concerned may be advantageously encouraged to scrutinize and clarifv their ideas on scientific method and to challenge purely statistical inferences whenever these are unconvincing. tics,
.
.
The direction of communication
W. Hammi,
have been re atedly shocked to find out often I though knew what I was talking about; but that in ^cid test of describing I
how
retain
to
inter-connections
.
R.
as
.
All too frequently statisticians imof
...
design of experiments
in
part of his training as a biometrician-in-residence
and routinely by researchers for purposes for which they were not intended. ... In biologic
sults,
statistical
with regard to future applications. ... It is absolutely vital that the future biometrician spend
the proper uses of statistics. that these
.
taught
William Feller I
.
.
I
What cal
is
surprising about
statisticians
all
these
criti-
have come from rather than from clinicians or
comments
is
that they
and
Introduction
other recipients of the statistical consulta-
In an era in which patients have
tions.
been increasingly vocal in complaints about the services received from their clinical consultants in problems of medicine, clinicians have been notably silent in commenting on the "quality of care" in interchange
the
clinician
"doctor"
who
when
occurs
that
coming
"patient"
a
is
the to
a
a statistical consultant in
is
problems of research. Clinicians have had many of these consultative encounters. Courses on statistics are now offered in the curricula of most medical schools. Editors of many medical,
and other journals
psychologic,
tinely request that suitable
passed by a
will rou-
manuscripts be Proposals
censor.
statistical
for large-scale clinical trials are not only
designed by
often
statisticians,
must be approved by the project
is
but also
statisticians
before
funded.
unidirectional. Thus, although the scripts
manu-
and contents of medical journals
are regularly subjected to critical statistiappraisal,
cal
methods used for papers that have appeared in the medical literature, 24 8 33 3i but I am not aware of any comparable "
'
almost no evidence of ex-
-
inappropriate or somemedical assumptions contained in papers that appear in the statistical literature. Innumerable books have been written on the general topic of elementarv statistics for clinicians, but no one has written a clinical primer for critiques
times
the
of
bizarre
statisticians.
time that clinicians began to widen narrow path of communication and to
It is
this
inform our colleagues of the
knowledge
pertinent
during those
and
many
Although the
many
spent
learned
years in hospital wards
clinical practice.
tician has
statistically
was
that
statis-
years in graduate
school getting his Ph.D. degree, and can tell us a great deal about what he discovered during that time and afterward,
many
the clinician has spent
During these many interplays of statistics and medicine, however, the path of consultative enlightenment has remained
rationale
years getting
not only his M.D. degree, but also such "degrees" as postdoctoral additional F.A.C.P. or F.A.C.S. If clinicians need to
know about
the mystic
statistic,
statisti-
cians might benefit from discovering the
pinnacle.
clinical
We
have much
all
to
teach each other.
posure to clinical reviewers can be found in
the
many
that regularly
medically
the
Many the
papers
appear in such periodicals
as Biometrics, Biometrika,
of
oriented
and the Journal
American Statistical Association. have been published on
critiques
unsuitable
or
incorrect
statistical
The composition of
"statistics"
One
first
the
of
main
steps
process of mutual education nize
that
domain,
"statistics"
containing
this
recog-
composite
a
is
at
in
to
is
two
least
tinctly different intellectual activities:
dis-
(1)
the acquisition, logical organization, and •To avoid ambiguity, let me define a clinician as a member of one of the healing professions such as medicine, osteopathy, and clinical psychology who takes
—
direct
responsibility
for
the
care
of
living
—
patients,
or
who
has spent substantial amounts of postgraduate time (more than an internship) in developing his skillful knowledge of such activities. The clinician may be in private practice, academic research, or administrative work, but his distinguishing characteristic is a background of observational and therapeutic experience in dealing with sick people. Although an M.D. degree is sometimes regarded as the hallmark of a clinician, many M.D's
such
as
anatomists,
epidemiologists, gists,
biochemists,
and physiologists
— may
"clinical"
pathologists,
pharmacolohave neither the training nor
microbiologists,
numerical presentation of data, and (2) the analysis of the data to arrive at decisions
about degrees of variation, interrela-
tion,
and
activitv tistics;
difference.
is it
often
The
called
first
type
descriptive
of sta-
produces the collections of data
that appear in baseball batting averages, in financial charts, in the birth rates
death rates of
many
"vital statistics,"
and
and
in the
graphs, tables, and other numerical
pathologists,
the functional responsibilities of clinicians. This definition is intended only to clarify what I am talking about, and has no pejorative connotations in any direction.
expressions of biomedical projects ranging
from molecular explorations to therapeutic surveys. The second type of activity,
Introduction and rationale
8
sometimes cited as "inferential statistics," will here be called inductive statistics; it is responsible for such calculations as correlation
and
regression,
linear
and
coefficients
multivariate
provides the
it
confidence
values,
t
square
techniques,
tests,
chi
intervals,
analysis
of
and other procedures used
to
P
variance, "sta-
test
Although these two intermingled
activities are regu-
in
each
and
activity,
background
there
biostatistics,
the content of
are drastic differences in in
prerequisite
the
performance. In condescriptive statistics deals with ob-
tent,
servations
for
actual
of
phenomena, rived from
its
previous
and
substances
concepts de-
with
classified
and
observations,
prepared for the sake of direct information, comparison, and application. In this world,
real
organizes
the
descriptive
statistician
histograms,
tabulations,
his
and other numerical statements that summarize the observed
averages,
facts.
rates,
Inductive
based
on
statistics,
contrast,
in
and
continuity of variables
is
about
assumptions
idealized
linearity of re-
on theoretical concepts of probability; and often on the abstract idea of taking random samples an indefinite number of times from a vast uninlationships;
spected population whose distribution
deemed
is
of
descriptive
have
statistics
become an integral part and are often learned
of
domain
the
concomitantly,
without conscious
effort, as
one learns the
contents
domain.
Inductive
the
of
sta-
on the other hand, requires intensive deliberate study; and for most people tistics,
who
take courses in "statistics," the main
motive
tistical significance."
larlv
procedures
learn
to
is
maneuvers
and
various
the
tests.
Most
analytic
elementary
courses in statistics contain a brief
hom-
age to the medians, means, standard deviations, frequency polygons, and other numerical tactics of descriptive statistics, but the instructor thereafter usually concentrates on the theories of probability, inferential strategies,
and various analytic
procedures that are considered the major domain of the trained statistician. Thus,
most
boys
12-year-old
in
United
the
States could state the descriptive statisti-
methods
cal
determining
for
baseball
grown
batting averages, but few boys (or
men
would know how
)
cally that
decide
to
one player's average
statisti-
signifi-
is
cantly higher than another's.
The
antipodal
ease real
learning
of
world data
between
differences
these two types of activity
—the
apparent
and understanding the of
descriptive
statistics,
in contrast to the theoretical training
mathematical
efforts
and
necessary to compre-
demonstrated to be true "parameters" are estimated but never known. In this imaginary world, the inductive statistician
hend the methods
develops
intellectual failings of contemporary biostatistics. Familiar with methods of producing descriptive statistics in biologv,
but
"normal,"
world of of
not
and whose
models
that
used
are
in
the
reality to express the association
events and to quantify the
contribu-
prerequisite
ground, descriptive
educational statistics
back-
requires no
particular scholastic training in "statistics"
and
is
regularly performed
people whose main
by
intelligent
statistical skill is
only
the ability to keep careful records.
For
example,
automobile
drivers engage in whenever they check their gasoline mileage. For complex scientific domain such as clinical biology, the
descriptive statistics
sultative
role
of
biologic science,
statisticians
in
modern
and also for the
many
the clinical investigator tends to disregard
tions of chance.
In
of inductive statistics
can be held responsible both for the con-
and take them for he seeks statistical advice, he usually wants management of the analytic procedures that he does not comimportance
their
granted.
When
prehend. Unfamiliar with the basic logic and data of the descriptive medical statistics,
their
the
unchallenged by and somewhat awed by
mathematically contents,
associated
mystique,
the
statistician
often accepts the clinician's request, per-
Introduction and rationale
tests, and hopes the clinician be sensible enough to ignore results
forms the
about the relative value of these three
will
players to the team?
that are inappropriate.
During this collaboration, both men engage in the self-delusions that create the major fallacies of biostatistics. The cliniimportance of his
cian, forgetting the
own
contribution to the logic and data of the research,
becomes mesmerized by what
he does
not
analyses.
He
How much
does each
player contribute to the general spirit and
rectifying errors in
morale of the team? Are they "colorful" players who will lure spectators to the box office? Is B a relief pitcher who has saved many important games, whereas A is an
validate
that problems
and data have already
He
These and many other queswould immediatelv be contemplated
outfielder? tions
by any connoisseur of baseball who
is
and he concentrates on the fit them into his array of
produced by the statistical analysis would be dismissed as unimportant, and the main judgment would depend on many
analytic
statistical
may
maneuvers. The ensuthe clinician, the
satisfy
the
editor,
agency, and the reader
the
—but
granting
what
really
often an elaborately analyzed significant" collection of
and bad data whose
scientific
bad de-
merely neglected but actually embellished and convoluted amid the mass of numbers and statistical tests. Let us consider a nonmedical illustration. Suppose the administrative advisor of a major league baseball team must deficiencies
many
asked to evaluate players. The P values
will
logic
to
accepts the data
way he
"statistically
attention
the
believing
descriptive statistics.
is
a really sound de-
batting activities.
somehow
been resolved or are unresolvable, becomes oblivious to what he does not understand: the clinical background of the
emerges
would require
cision
what the P
matter
statistical
statistician,
statistician,
no
To make
values show.
assumes that the
in the basic logic
ing results
worth,
other features of the players and of their
activities,
as presented,
not an appropriate test of a
are
statistical
observation and correcting distorted logic.
The
alone
player's
the
understand:
computations will
more basic
The answer, of course, is that he should draw no conclusions. The statistical analysis was silly, because batting averages
are
not
items of descriptive data that are statistically omitted from the batting averages.
Suppose, however, that our administrative advisor is not a connoisseur of baseball and, in fact, does not understand the
game
at all. His reputation as a consultawizard in athletics may have been tive previous success in a sport his based on like
horse racing, where the horse's per-
can always be quantitatively evaluated from such numerical indexes as
formance
cide
time and finishing position in races. Upon asking a connoisseur of baseball for "innumerical) help about (i.e., structive"
chooses the batting average as a useful
greatly disappointed to receive such non-
what salaries should be offered to the team next year. He searches for a way of evaluating the worth of the players and index.
chance
An enlightened man, he knows that may enter into a batter's success,
and so he performs the
differences
statistical
among
analyses of
players.
He
finds
was higher than Player B's, with a P value of < 0.001, and that Player B's average was higher than Player C's, but with a P value of only < 0.2. The first of these statistical differ-
that Player A's average
ences not.
is
"highly significant"; the second
What
should
the
advisor
is
conclude
players,
evaluating
may be
advisor
the
quantitative phrases as "spirit," "colorful,"
and "saving games." Knowing analysis
worked
of
dimensional
so well
activities,
the
in
his
advisor
that
the
information
has
previous sporting decides
the "soft" verbal descriptions his
evaluations
to
and
ignore to base
on the "hard" numerical
The owner and manager of the team may be somewhat queasy about the idea, but they go along with it and say nothing because data of the batting averages.
Introduction and rationale
10
they do not want to seem ignorant about
with that disease are regularly "lumped"
P
together for the
and besides, they are reluctant to doubt the word of an eminent consultant who has been widely acclaimed for values,
his success in
applying
statistical
analysis
to athletics.
The
would The game is too well known to too main people, and even if the owner and manager were foolish enough to entrust their evalsituation
described
just
probablv never happen
baseball.
in
uation of players to this type of statistical
would probably soon
they
analysis,
re-
scind the action because of the roars of
outrage, laughter, and protest that
would
ensue from sportsw riters and baseball fans. But the activities of biomedical research, not so well known, and counter-
alas, are
parts of the foolishness just described constantly occur,
the activities ol
although
modern
less
obviously, in
When
of
three
elementary
of
principles
(1) the neglect of heterogeneitv of a population during the process
analysis:
of
comparing
members;
its
(2)
the use of a single property as the
index
for
pressed (
3
)
assessing
multiple
in
a
way
for future clinical application.
Univariate responses. Since the per-
2.
formance of specific
trial
property
careful
statistics
appraisal
of
has
a
in
clinical
such
"variate"),
and
properties, tive
a
to
Examples of these violations are so abundant in biomedical statistics that the singling out of any specific situation would be unfair to the many outstanding tistics.
the assessment
of course,
based on a multitude of
These
inconvenience.
however, statistician
clinical
unattrac-
often
are
because thev repre-
appraisals
qualitative verbal phrases.
expressed
Even
in
these
if
data were to be accepted however, the importance of property could easily each not be "weighted" for combination into a single "soft"
clinical
for analysis,
index.
the properties were not comand were analyzed separately as
If
individual
a
is
subjective
overwhelmed
sta-
evaluated in the real world
is
incapacitation and the treatment's side effects
bined,
descriptive
as
When
properties, including the patient's pain or
infatuation with
the
chosen for evaluation, treatment (or
clinician,
of response
sent
the application of a statistical test to
inductive
to
regularlv expressed in terms of a
is
single
ex-
and
manifestations;
work because an
statistical tests requires that a
index be
response
the
main
phenomenon
prove a "significance" that may be meaningless or unimportant. These and many other departures from sound scientific reasoning constantly occur during biostatistical
reported for
total
by a
situation just cited contained viola-
scientific
treatment.
of
later
group of patients, the clinician of knowing whether the compared therapeutic agents had the same effects in the good prognostic risks as in the bad, or whether patients with different degrees of clinical severity responded differently. Because heterogeneous patients have been statistically managed as homogeneous, the results of an elaborate, expensive trial may have little or no value the
has no
treatment
tions
are
survival time or white blood count.
biostatistics.
Violations of scientific principles
The
allocation
the results
sponse, tain
indexes
of
therapeutic
the statistical results
an array of
many
tabulations,
and P
values,
answer.
The reader would have
instead
of
a
re-
would consingle to
tests.
neat
review
here will be confined to problems that often occur in the evaluation of thera-
the reports of the diverse responses, and might have to make his own decision about which variates are important, based on clinical and biologic values, rather
peutic
than
candidates for selection.
1.
illustrations
trials.
Neglected
statistic ally
major
The
c
heterogeneity.
designed
ronic
trials
diseases,
In
many
of therapy for
all
the
patients
statistical tests.
To avoid
messy imprecision, the the
neatness
variate
of
response;
this
type of
statistician opts for
analyzing a single unithe investigator agrees;
Introduction and rationale
and another to the
clinical
produces sim-
trial
that have
results
plistic
complex world of
little
pertinence
Spurious significance. The biostatistical
3.
malpractice committed with tests of "sig-
become, as noted earlier, 5 and the phrase "statistical sighas become such a malignant
nificance" has
a scandal, nificance"
mental pathogen that major efforts to excise it will be undertaken here in a future Let
discussion.
moment
for the
suffice
it
note that a test of statistical
to
cance
signifi-
nothing about the quality of planning, or execution in the
tells
thought,
work; nothing about the biologic or cal
meaning
of the difference in
clini-
numbers;
and nothing about whatever has allegedly caused the difference. Too often, how-
word
the inappropriate
other biostatistical infractions of scientific
The
principles of research.
statistician
not really responsible for the
reality.
11
is
difficulties.
He is doing the best he can with what he knows, but he often makes the false assumption that what he does not know about the subtleties of clinical biology will be amply managed by his clinical colThe
leagues.
clinical
investigator
He
is
not
been told that his complex clinical problems will be solved by statistical help, so he gets it, but he often makes the false assumption that inductive statistical sagacity eliminates the need for interpretive clinical wisdom. The consequence of this folie a deux is a mutual belief that the main targets of clinicostatistical research responsible,
really
either.
has
significance
are the rigid digits of statistical analysis,
has served to intoxicate a research worker
rather than the valid facts of biologic sci-
preconceptions
ence. Instead of giving basic scientific at-
ever,
with
the
that
belief
his
have been confirmed, to divert an editor or reviewer from carefully contemplating the
logic
and
data,
and
delude
to
a
reader into believing that the value, importance,
and meaning
somehow been
have
of
research
the
authenticated
cause the results are "statistically In the planning of clinical
cant."
the
number
needed
of patients
trials,
for "sta-
significance" has currently
tistical
be-
signifi-
become
a major focus in the thoughts devoted to
and the calculation of this numsometimes the acme of the biostat-
"design,"
ber
is
a and
[i
with the quantitative lure of the
statistics.
How
can the situation be improved? The problems have obviously not been
by
solved
statisticians'
many
efforts
to
educate clinicians. As a result of these
and
other
now
regularly
proved
such
use
procedures
random
investigators
clinical
efforts,
as
statistically
ap-
groups,
control
and double-blind But the "controls" are often
allocations,
techniques.
the
chosen inadequately; the random allocaand the are often meaningless; tions
levels of "significance," of course,
double-blind techniques often yield amau-
contribution.
istician's
and data of the rebecome obsessed
tention to the logic
search, the collaborators
To manipulate
must often neglect the the patients and must calculations on univariate ap-
from which no one can
the biostatistician
rotic results
heterogeneity
cern what has been accomplished. Perhaps the educational efforts of the past
base
his
of
praisals of response
science
and
—but
the disregard of
of clinical applicability
seems
unimportant as long as the number that emerges from the calculations holds the glittering
promise
of
"statistical
signifi-
Sources of difficulty
No
single simple cause can
the
lamentable state of
Perhaps
clinicians,
unidirectional.
instead
of
being
purely passive recipients of statistical consultation, should
tributors
in
now become
active con-
a process of intellectual ex-
change that enlightens both the statistical and clinical participants. What contribution can clinicians make to help convert
cance."
for
few decades have been too
dis-
be blamed affairs
produces the cited violations and
that
many
statistical
art
into
biostatistical
science?
Since the greatest knowledge at the clinician's
disposal
is
his familiarity
with the
Introduction and rationale
12
data and patterns of events that describe
To read about
the biologic realities of nature, the clini-
of the
cian can teach the artful inductive statis-
forced
about
tician
of
science
the
descriptive
many
and mathematicians properly regard them-
creative
as
not
in
they
since
artists.'
challenges
seek
their
nature but in abstract
systems or "spaces" conceived as acts of artistic imagination. The theories created not a part of
are
art
statistical
this
in
science unless they relate to nature, with
concepts
)
Since the statistician
statisticians
selves
signs, the student is begin bv knowing what diaglook under.
to
noses to
statistics.
Most
and
either
that
tactics
emerge
world
of
on
rely
develop his perception of these natural realities. His postgraduate activities may also not expand his scientific
because he
vision,
playing
in
to
the
may
Procrustean
—a
which he was trained and altering the data
perfor
role
role of choosing
to
fit
his
in-
tests,
stead of choosing or altering his tests to fit
the data.
remark should review the taxonomic arrangement of topics in textbooks on statistics. If an investigator has a problem that requires testing the difference between two collecthat
last
tions of nominal, ordinal, or metric data,
he
will
be unable, with rare exception, 7
-
36
to find a statistical textbook that arranges its
topics according to the types of data.
The
topics
are
almost
all
arranged ac-
cording to the available statistical tests, not according to the characteristics of the data.
Thus,
the
he
has
1
make
the ob-
arrange their
logic,
and
explain
to
their
not done
significance.
so.
to
And we have
the statistician has built
If
he is not to blame; have not onlv failed to put the on a sound foundation, but have in
castles
castles
the
moved
air,
into
the-
and
castles
illusive
extolled the architecture.
Plans for the future
columns of "Clinical biostahope we can provide some solid ground on which clinicians and statistifuture
In
tistics," I
cians can join to build a better structure
an inductively oriented
for the future. If statistician
Anvone who doubts
observation,
clinical biologists to
servations,
often
experience with which
scientific
had no way of becoming familiar with the complex data and intricate logic that describe that world. He has had to
given him
or no direct observational
been
really
clinicians
little
has seldom
trained or provoked to enter the natural
from nature, or that are compatible with the events of nature. As discussed earlier. the statistician's customary education has
sist
the diagnostic significance
symptoms and
clinician)
(or
perceptively
a
believes
that
realistic
descriptive
sta-
an unsatisfactory title for this activity, let us call it something else. One of the most distinguished contemporary statisticians, John W. Tukey, has proposed tistics
an
is
new
excellent
term,
data
analysis.
Tukey 57 has also proposed some excellent rules and plans for the new meeting ground:
We
should seek out wholly
he answered. problems in more .
.
.
We
realistic
new
need
to
questions to tackle
frameworks.
.
.
.
old
We
investigator already choose for a particular
should seek out unfamiliar summaries of observational material, and establish their useful proper-
problem, he can find the test. (An analogous type of backward logic occurs in the
... It can help, throughout this process, to admit that our first concern is with "data analysis". ... To the extent that pieces of mathematical statistics fail to contribute, or are not intended to contribute, even by a long and tortuous chain, to the practice of data analysis, they must be judged as pieces of pure mathematics, and criti-
knows what
if
test to
organization ui cal diagnosis"
nostic
many textbooks on "physiand other aspects of diag-
medicine.
Students often find the
textbooks
frustrating
organized
according
though search
because to
they
are
"disease,"
al-
ties.
cized according to
a
student begins his diagnostic observing symptoms and signs.
its
purest standards. Individual
mathematical statistics must look for their justification toward either data analysis or pure mathematics. Work which obeys neither parts
of
Introduction and rationale
master
.
cannot
.
.
doomed
careful that, in
work
it
.
.
.
be
to
fail
sinking,
its
transient,
he
to
And we must be
sink out of sight.
to
does not take with
it
up the vain hope can be founded upon a logico-
we need
Finally,
to give
deductive system like Euclidean plane geometry and to face up to the fact that data analysis .
.
.
be true that there
still
...
an empirical science.
intrinsically
will
be the hallmarks of stimulating science:
demanding
tual adventure,
need
a
"how
out
find
to
and
investigation
with experience.
.
things
The
.
really
.
analysis
3, in
Olkin,
bility
and
I.,
(Essays in honor of Harold
Stanford,
Stanford
1960,
Calif.,
Mahon, W.
13.
and Daniel, E.
A.,
assessment
for
reports
of
of
E.:
A
the flexor digitorum sublimis muscle,
14.
The technique
D.:
areas
irregular
with notes on the 63:345-351, 1929.
.
ment.
Mainland, small
future of data analysis
15.
trials,
Canad. Med. Ass. J. 90:565-569, 1964. Mainland, D.: An uncommon abnormality of 62:86-89, 1927.
insights
method
drug
and are" by
of
vs.
editor: Contributions to proba-
statistics
intellec-
(depends on) our willingness to take up the rocky road of real problems in preference to the smooth road of unreal assumptions, arbitrary criteria, and abstract results without real attach.
12.
insight,
confrontation
the
.
upon
calls
Numerical
University Press.
be aspects of data
analysis well called technology, but there will also
W.:
R.
Hotelling),
will
It
Hamming,
mathematics, Science 148:473-475, 1965. 11. Hotelling, H.: The teaching of statistics, chap.
of continuing value.
that data analysis
is
10.
13
Mainland, D. in
ovarian
:
A
Anat.
estimating
biological
in
tests
of
J.
research,
of accuracy,
J.
Anat.
study of the sizes of nuclei Anat. Rec. 48:323-340,
stroma,
1931. 16.
Mainland.
The measurement
D.:
of
ferret
pronuclei, Trans. Roy. Soc. Canad., 3rd series,
At our next meeting, two months from now, the topic for discussion will be the
25: Section V, 9 pages, 1931. 17.
design of experiments. In this domain of
experimental design, which has long been regarded as the province of inductive statistics, clinical biologists can make major contributions
the
to
future
scientific
of
Mainland,
study of the with a note on the second polar spindle, Amer. J. Anat. 47:195-
19.
Some
Cox, C. P.:
Anticoagulant therapy, Phila-
S.:
A.
Feinstein,
and
The problem acute
in
A.
Feinstein,
and
R.,
fever,
H.:
Spitz, I.
Res. 14:107-124, 1934.
The
epi-
D.: Chance and the blood count, Canad. Med. Ass. J. 31:656-658, 1934. (Edit.) 23. Mainland, D.: Problems of chance in clinical
work,
Clinical prob-
surveys, Arch. Intern.
statistical
W.
Are
:
Med.
Research
4:24-29,
methods
1925,
1969.
for research
Oliver
& Boyd,
applied
statistics,
Freeman, L.
New 8.
York,
Gifford,
tique
of
C: Elementary
1965, John Wiley
&
Sons, Inc.
and Feinstein, A. R.: methodology in studies of
R.
H.,
A
cri-
antico-
agulant therapy for acute myocardial infarction, 9.
New
Eng.
J.
An
Med. 280:351-357, 1969.
Halmos, P. R.: Mathematics as a creative Amer. Scientist 56:375-389, 1968.
art,
introduction to statistical
and methods for medical and dental workers, Edinburgh and London, 1938, Oliver & Boyd, Ltd. Mainland, D.: Elementary medical statistics:
25.
The
Ltd. (Ed. 13, 1963.) 7.
Med. J. 2:221-224, 1936. D.: The treatment of clinical and
ideas
overawed by
scientists
R. A.: Statistical
Edinburgh,
Brit.
24. Mainland,
laboratory data:
life
Scientific
statistics?
workers,
in-
22. Mainland,
123:171-186, 1969.
6. Fisher,
by
D.: Forces exerted on the human mandible by the muscles of occlusion. J. Dent.
of evalu-
rheumatic
demiology of cancer therapy.
5. Feller,
exerted
21. Mainland,
stethoscopes,
Standards,
R.:
statistics.
treatment
Company.
Pediatrics 27:819-828, 1961.
lems of
forces
muscles,
1933.
Douglas, A.
delphia, 1962, F. A. Davis
4.
The
D.:
human
with special reference to errors in muscle measurement, Trans. Roy. Soc. Canad., Section V, pp. 265-276, dividual
observations on the teach-
801, 1968.
ating
of ferret pronuclei,
Anat. Rec. 50:53-83, 1931.
20. Mainland,
ing of statistical consulting, Biometrics 24:789-
steroids
sizes
Mainland, D.: The volumes of ferret ova, with methods of determina-
tion,
3.
The
Mainland, D.:
Anat. Rec. 49:103-120, 1931. special reference to the
References
2.
quantitative
ferret,
240, 1931. 18.
data analysis.
1.
A
D.:
body of the
polar
principles of quantitative medicine, Phila-
delphia,
1952,
26. Mainland,
ed.
2,
D.:
W.
B. Saunders
Company.
Elementary medical
Philadelphia,
1963,
W.
B.
statistics,
Saunders
Company. 27. Mainland, D.:
Anatomy
and dental education, B.
Hoeber,
Inc.,
as a basis for medical
New
Medical
Harper & Row, Publishers.
York, 1945, Paul
Book Division
of
Introduction and rationale
14
28.
Mainland, D.: Notes from a laboratory of medical statistics. Note 20, pp. 8-9, October
34. Schor,
Mainland, medical
Notes
D.:
statistics.
from
laboratory
p.
3.
Note 104,
a
March
of
35. Schor, S.,
program
for
21:28-31,
Statis.
I.:
Statistical evaluation
J.
A. M. A. 195:1123-
1128, 1966.
30. Mainland,
medical
Notes
D.:
from
laboratory
a
Note 116,
statistics.
p.
4,
May
of
36. Siegel,
32.
Mainland,
10-1. Suppl. 7. p.
n.:
1,
June. 1969.
Pharmacol. Ther.
Clin. 33. Saiger,
M.
G.
L.:
Errors
of
10:576-586,
medical
A. 173:678-681, 1960.
37. Skellani,
1969.
studies,
New
statistics
York,
for
1956.
the
McStrat-
Tukey, J. W.: The future of data analysis, Ann. Math. Statist. 33:1-67, 1962. 39. Zelen. M.: The education of bioinetricians, Amer. Statis. 23:14-15, 1969.
38.
Ward Rounds—16,
Statistical
Nonparametric sciences,
Graw-Hill Book Company. Inc. C: Models, inference, and |. egy, Biometrics 25:457-475, 1969.
1965.
Mainland, D. Notes on Biometry in Medical Research. Veterans Administration Monograph
S.:
behavioral
13,
31.
A.
and Karten,
of journal manuscripts,
16,
1965.
J.
Amer.
1967.
24, 1961. 29.
reviewing
Statistical
S.:
medical manuscripts,
SECTION ONE
THE ARCHITECTURE OF COHORT RESEARCH Although most
statistical interpretations of
experiment, most medical research
is
research depend on the idea of an
not experimental. The people assembled for
the research are almost never chosen randomly from the groups they allegedly represent; the "causal" agents under comparison are seldom contrasted concurrently;
and the agents are not assigned according to a suitable prearranged plan. clinical and epidemiologic investigations are conducted as surveys,
Because most
not experiments, major problems of bias or distortion can occur
when
the com-
pared groups are assembled and when their results are analyzed. Even when the research
is
experimental, however, such problems can
still
appear because of
flaws in chronologic aspects of design or because of vicissitudes in the action of fate during
random choices
or
random assignments.
Although diverse mathematical models have been developed analysis of research data,
for the statistical
no corresponding models have been available for the The first few essays in this section
scientific architecture of the research itself.
are concerned with the establishment of an appropriate scientific
ing the research structure of a cause-effect relationship. fined baseline condition or initial state is
the causal agent or effector.
state.
For contrast, another
is
exposed
The outcome
is
An
model
for
show-
some demaneuver, which
entity in
to a principal
then observed in the subsequent
entity, in a similar initial state,
is
exposed to a com-
parative (or "control") maneuver.
The model
is
quite simple
and
direct, since
it
takes the exact form of the
normal "anatomy" and "physiology" of an experiment. The model plicable to any form of cohort research
—
is
directly ap-
—whether etiologic or therapeutic, survey
which the observed groups are followed forward in time (or "longitudinally"), being investigatively pursued in the customary scientific direcor experiment
tion
in
from imposition of a "cause" to occurrence of an
If
"effect."
the groups consist of people, however, diverse features of
cal life
human and medi-
can produce biases that create "pathology" in the research structure. As
people traverse the pathway from anonymity tistical units,
the biases
may
distort
at
home
to immortality as biosta-
the comparison of the maneuvers by altering
compared groups, the performance of the maneuvers, the detection of the outcome events, or the chronologic duration of persistent observation. The act of allocating the compared maneuvers by a randomization the baseline equality of the
15
16
The architecture
process
is
it provides no guarantee that these difficulties will randomized allocation can sometimes increase the problems
often helpful, hut
be avoided. In by
of cohort research
fact,
lulling the investigator into a state of false- serenit)
All of the cited hazards of cohort research can occur is
altered, so that the groups are
rather than forward direction.
pursued
the research format
in a "cross-sectional"
The changes
some separate additional problems
when
in direction,
or "retrospective"
however, also introduce
that will he discussed in Section
Two.
CHAPTER
2
Statistics versus science in the
design of experiments
The design
been a
of experiments has
favorite concern of statisticians ever since
review are classified in such groups as factorial designs, response sur-
R. A. Fisher, 10 after
emphasizing the impublished
face designs, designs for nonlinear models,
second book in 1935, entitled The Design of Experiments, 17 and stated that "a
are thus concerned mainly with statistical
portance of
statistics in research,
a
and standardized staprocedures will ... go far to eluci-
clear grasp of simple tistical
date
the
principles
Fisher's book,
now
in its eighth edition, 18
has been followed by tical texts
minimal
experimentation."
of
many
and papers on this I was able to
effort,
rent statistical books, 1 37 -
85-37
other
'
18
15 >
>
With
topic.
2G '
>
2S
-
30
'
31 >
written in English, with "experimental
design" or
some congener
in their titles. In
a recent bibliographic review 23 of the topic, the
authors
about 800 items pub-
cited
lished since 1957.
After inspecting the publications
and the
names
of these 800
tables of contents in
the textbooks, a clinical investigator might
wonder
about the kind of knowledge to plan scientific experiments dealing with sick people and human popu-
necessary
The
lations.
are
topics in the statistical texts
devoted
generally
blocks,
Latin
squares,
ments,
to
squares,
Youden
Graeco-Latin
squares, lattice arrange-
confounding,
partial
randomized
analysis
of "significance."
With the same numc, "Clinical
biostatistics
11:282, 1970.
The papers
this
—
II."
of
and
tests
in the
bib-
variance, analysis of eovariancc,
and
chapter originally appeared as In Clin. Pharmacol. Ther.
serial designs.
The books and papers and
tactics in analyzing data,
attention
is
little
or no
given to such clinical biologic
problems as defining the goal of the experiment, choosing the experimental material,
and validating the
Amid
statis-
find 16 cur24
liographic
results.
the intensely statistical discussions
contained in
this literature
on experimental
design, there occasionally appears a est scientific
mod-
warning, such as "uniformity
the only requisite between the objects whose response is to be contrasted," 20 but the warning is not accompanied by a description of scientific methods for assessing
is
"uniformity" or appraising "response."
One
makes the experimenter's
role
writer even
wholly subservient to the statistician's by stating that "the purpose of an experiment is to produce a sample of observations which will furnish estimates of the parameters of the population together with measures
of
the
uncertainty
of
these
esti-
mates." 34
Many years of exposure to these purelv mathematical ideas about experimental work mav have obscured the realization that for most scientists the purpose of experiments is to get answers to questions. A scientist's main concern is not the idealized elegance of a mathematical design, but a realistic plan for asking an important question in a
way
that will yield a reliable an-
17
The
18
swer.
The
not just
architecture of cohort research
object
is
to get scientific validity,
and mean-
statistical "significance";
ingful answers, not just magisterial
num-
bers.
The
designs discussed in statistical
liter-
ature provide splendid examples of mathe-
matical art in the imaginary world for which they were created. They may even be scientifically satisfactory for the agricultural fields or chemical vats in which the models have often been applied. But they do not focus on the basic problems of design in most activities of clinical biosta tistics. The statistical concepts do not provide methods for attaining the documentation, precision, validity, and reproducithat characterize scientific research;
bility
and the statistical designers have seldom had either the training or the experience needed to discern subtle scientific distinctions about human populations. Because a group of people has neither the homogeneous uniformity nor the simple responses of a
field or vat, investigators
can-
comprehend experiments on people merely by contemplating mathematical principles or by exnot properly design and
trapolating
that
theories
statistical
may
have worked for nonhuman material. Such scientific issues as the
choice of hypothesis,
appropriateness of sampling, suitability of of maneuver,
selection
control,
and
tion of prognosis,
stratifica-
criteria for
response
are critical principles of design in clinical biostatistics
—but
these principles are gen-
overlooked,
erallv
over in most
neglected,
glossed
or
statistical discussions of "de-
perimental
When
Sir
campaign
Bonald Fisher inaugurated the
for
statistical
attention
to
the
planning of experiments, clinical biologists to be persuaded that the anecdotal
needed
were no longer acand that medical science required
doctrines of the past ceptable,
the
nfirmation
ment
After
the ca
quantitative
experi-
more than three decades,
-ign has generally
cessful,
been quite suc-
nical investigators
now
use con-
andomization, and quantifica-
trol groi.
in
of
n
designs
of
biomedical
re-
trials
of therapy. Unfortunately,
however, main of these projects have been designed according to abstract strategies of statistical principles, rather than realistic
methods of
clinical
The
science.
control
groups are often chosen improperly, and are not truly comparable; the randomizausually performed promiscuously, no concern for prognostic heterogeneity; and the quantification is often spurious, because the wrong variables were measured. Fisher's complaint about the clinical practices of 35 years ago was that "The liberation of the human intellect must remain incomplete so long as it is free only to work out the consequences of a prescribed body of dogmatic data, and is denied the access to unsuspected truths, which only direct observation can give." 19 The pendulum has now swung far enough so that the same complaint is once again cogent, but its target today would be the tion
is
with
.
.
dogmas As
.
of current statistical theories.
direct observers of experimental phe-
nomena in people, clinical owe statistical colleagues an
investigators
access to a
methodologic description of important biologic principles that are unsuspected, ignored, or minimized in the current infatuation with statistical designs. this
paper
My
object in
to outline the diverse struc-
is
tures that are considered or created in the
experiments of clinical research, and to dicate
some
tures cannot
current
sign.
tion
and the medical literature contains an abundance of statistical surveys and ex-
search,
of the reasons
why
in-
these struc-
be adequately designed with models. Using these
statistical
structures as a
new
basis for the architec-
ture of clinical research, subsequent papers
be devoted to operational and other details of scientific
in this series will
principles
methods for the constructions. The structure of
The
clinical
experiments
basic structure of any experiment
sequence in which exposed to a maneuver,
consists of a temporal
a preparation
is
and undergoes a response. The preparation
is
described according to
its
initial
Statistics versus science in the design of
state,
and the response
determined by
is
Table
experiments
The experiments
I.
19
of nature
noting the subsequent state either alone or
comparison to the initial state. For the clinical biologic experiments that will be
Subsequent
Initial
in
considered
who
person,"*
"preparation"
the
here,
healthy
is
is
diseased
or
a
Maneuver
state
Type
of
experiment
state
Healthy
Normal growth
Healthy
Ontogenetic
Healthy
Develop-
Diseased
Pathogenetic
Healthy,
Pathogressive
in
subsequent state, or both. The maneuver performed in the experiment can be chosen bv nature, bv the investigator, or by the person who either the initial state, or the
ment of disease
Diseased Clinical
acts as the experimental preparation.
course of
diseased,
disease
or
dead
In this sequence of
Maneuver
Initial
State
the
Subsequent State
>
crucial
scientific
designing
in
issues
and analyzing the experiment depend on why the experiment was done, what maneuver was used, who chose the maneuver, and what was the state of the preparation before and after the maneuver. The experiments of nature. Nature performs at least three different experiments
The structure shown in Table I.
that are of clinical interest." of these experiments
As an ontogenetic healthy
person
is
activity,
nature allows a
change
to
he grows
as
As a pathogenetic activity, nature makes a healthy person diseased, or creates a diseased newborn. As a pathogressive acolder.
nature takes a diseased person through the clinical course of the disease. In most of these events, nature chooses the experimental maneuver, but in other activity,
tivities,
the choice
is
made by
the affected
of nature are constantly studied with statistical
methods. Tabulations of these natural
experiments are the basis for birth
death
rates,
tributions
of
rates,
disease,
"vital statistics";
and
of the
all
geographic
dis-
and other data of
for all of the statisti-
cal epidemiologic studies of causes of disease. 10
Such tabulations are also used to determine the "range of normal" for various types of clinical and laboratory data in the conditions created
bv nature, and
to
establish the diagnostic concepts for iden-
Moreover,
tifving diverse diseases. tical
statis-
accounts of the clinical course of
dis-
ease are the basis for the extraordinary activities,
which
described in the next section, in
clinicians
impose therapeutic
inter-
vention on the events begun by nature.
For investigating these natural phenomthe
ena,
researcher's
main challenge
is
Such choices occur, for example, when someone drinks polluted water, or when two people with sickle cell trait decide to marry and produce children. These experiments of nature receive
not to create an experimental design, but
scant attention in statistical concepts of de-
sive"
person.
sign,
probably
because
the
investigator
does not choose the maneuver, and hence
engages
in
no
"experimental
design."
Nevertheless, these experimental activities
to devise a
plan for discerning the design
created by nature.
observation
and
rather
his activity
is
He
engages
than
in
planned
experimentation,
usually given the "pas-
name of survey, rather than the "active" name of trial or experiment. The experiments of man. In the observational
activities
just
described, the in-
vestigator wants to determine
what nature
has done, and he accepts nature's "maneuver" as the basic force that connects the °In
many
contemporary "clinical research," the basic material is an animal, a substance derived from a person or animal, or an inanimate system. - The discusactivities of
1
sion
here
is
limited to the types
of
clinical
which the "material" under surveillance group of persons.
is
research
in
initial
who
and subsequent
forms a
In what
is
state of
each person
statistical unit in the research.
usually regarded as an "experi-
a person, or
ment,"
however,
the
main maneuver
is
20
The
Table
II.
architecture of cohort research
The
experiments of
interventional therapeutic
man Planned subsequent
Initial
Maneuver
state
Healthy
Type
Prevention
of
experiment
state
Healthy
Contrapathic
Improved
Remedial
of disease
Diseased Alteration
Not worse
Diseased Prevention of adverse prop' SS
ing clinician.
No
fixed protocol
estab-
is
lished for the allocation of treatment within a group of individual patients, and the comparative "controls" depend on results
obtained in similar situations of the past. When the results of such individual adven-
or cured
of disease
an experiment/ the procedure is not usually regarded as an "experiment" because the maneuver for each patient is chosen in an arbitrary ad hoc manner by the attend-
Contratrophic
tures
in
ordinary treatment are collected
and analyzed, therapeutic
the
research
Unlike
survey.
is
the
a
called
activities
studied in an observational survey, routine
treatment involves a maneuver of
chosen by man. For the activities of clinical investigation, the purpose of this ma-
man
im-
posed on a maneuver of nature, and the purpose of a therapeutic survey is to discern what man's intervention has done to
When
same
to
change the course
the course of nature.
of nature, to explain the
way nature works,
perimental sequence and purpose are car-
methods for collecting and interpreting the investigative data.
ried out with a prearranged plan for choosing comparative groups and for allocating treatment to each patient, the ac-
neuver can be either or to provide better
Interventional experiments. Clinical ther-
apy
a
is
unique type of experimental ac-
because it contains the events of two simultaneous experiments, one imposed on the other: an act of man intervening in an tivity
act
of
clinical
nature.
The purpose
therapy
is
to
of
ordinary
change the course of
nature by preventing what nature
may
do,
what nature has already done. and the target chosen to be altered or prevented, therapeutic activities can be classi-
or altering
According
fied as
to the patient's initial state
remedial, contrapathic, or contra-
trophic, 9 as
shown
in
Table
II.
In remedial treatment, such as the relief of pain, the clinician tries to
move
a
symptom,
lesion,
modify or
re-
or other target
that already exists in a diseased patient. In
tivity is is
this
ex-
regarded as truly experimental and
called a therapeutic
trial.
Explanatory experiments. In the therapeutic activities just described, the
maneu-
ver was planned to create a permanent
change
in the patient's condition.
was
An
exist-
be remedied, or an expected subsequent state was to be pre-
ing initial state
vented. In
many
to
other clinical experiments,
however, the motive course of nature, but
is
not to change the
to explain the
way
in
which natural phenomena are created or altered. In such experiments, the maneuver is
used as a stimulus for transient reactions
that are analyzed as the "response," but the patient returns to the initial state after
tion against poliomyelitis, the clinician tries
is completed. probative experiments. The response to an experimental maneuver is often used
prevent a healthy person from becoming
for identifying the patient's capacities in
contrapathic treatment, such as immunizato
diseased. In contratrophic treatment, such as rigorous regulation of diet
and blood
sugar for diabetes mellitus, the object
is
to
the experiment
his
initial
state,
or for differentiating his
condition from the in other patients.
initial state
encountered
Thus, the electrocardio-
prevent adverse progress of an established
graphic changes after a burst of exercise
disease.
will
;ough each act of ordinary clinical there.
contains the temporal sequence of
sometimes distinguish healthy people from those who have coronary artery disease; the response of serum and urine in a
Statistics versus science in the
glucose tolerance test
is
often used for the
and various
diagnosis of diabetes mellitus;
types of climatic
and other physical
may be employed
stimuli
determine the range of physiologic response in normal people. These procedures are an active experito
mental counterpart of the more passive observational activities, described earlier, that
design uf experiments
21
is conducted when healthy and diseased patients are exposed to varying doses of a pharmaceutical agent to determine whether the patients with different clinical
ment
states
respond
differently,
and whether the
responses depend on the doses employed
ver and uses control groups, each experi-
maneuver. Methodologic experiments. The last main type of clinical experiment is intended neither to change the course of nature nor to explain it, but rather to investigate the methods used for the other types of activities. Examples of such methodologic explorations would be a studv of observer
ment
variability in the interpretation of roent-
are used to determine the range of normal
and diagnostic boundaries for various conditions of health and disease. Although the procedures can be called experiments, because the investigator chooses the maneureally
is
conducted as a "probe" to
what was created during the
identify
ante-
cedent "experimental" activities of nature.
A
design
is
prepared for these
clinical ex-
periments, but the basic aim of the design is
what happened in the origiby nature. MANEUVERAL EXPERIMENTS. In this type to discern
nal "design" prepared
in the
genograms, or a comparison of a patient's obtained by a computer "interview" versus that obtained by a physician. In such procedures, the "material" inhistory
vestigated in the "experiment"
man
is
the hu-
or mechanized observational appara-
"maneuver"
tus; the
the exposure of this
is
determine whether a particular maneuver can elicit a
"apparatus" to the film or patient; and the
specified response, or to assess the effects
pretation or history that emerges from the
of different degrees of the maneuver. An example of such an experiment is the attempt to induce disease by inhalation or
exposure. These methodologic explorations
of
experiment, the goal
injection
microbial
of
is
to
More
substances.
"response"
are seldom regarded as truly experimental because thev are based on "passive" observations of the phenomena used in ob-
and interpreting
examples of these experiments are the procedures performed in
tive"
"phase II" therapeutic investigations for
the state
characteristic
appraising the
mode
of action
and optimal
the roentgenographic inter-
is
taining
and no "acprobe or treat of the patients used in the "ma-
attempt
made
is
data,
to
neuver." Nevertheless, investigations of this
dosage of pharmaceutical agents. Unlike the probative type of explanatory experi-
tific
ments, which are intended to elucidate a
other form of clinical research.
stimulus
the
as
critical prerequisites for the scien-
validity of the data obtained in
first
part of the total experi-
ment, a healthv volunteer
may have ma-
The inadequacies of
Most clinical
these
of
squares,
lattice
statistical
in
has
testing the antimalarial therapeutic proper-
signed.
agent.
A
com-
bined probative and maneuveral experi-
structures
be
arrangements,
strategems
architecture
new pharmaceutical
scientific
design in
adequately
planned with the concepts described in current statistical writings about experimental design. Randomized blocks, Latin
successfully
a
statistical
cannot
research
induced so that the second part of the experiment can be used for
laria deliberately
ties of
any
and response created by nature,
maneuveral experiments are intended to clarify the operation of an agent of man. conjunctive experiments. Certain experiments are prepared as a conjunction of one or more of the two types of experimental activities just described. For example,
type are
But the
pertinent for
research
already statistical
many
and other
can often be used
whose scientific been well demodels are not
of the tvpes of clinical
investigation just cited, and,
when
perti-
The architecture
22
of cohort research
nent, are too superficial for the tal
demands
of scientific rigor.
cal tactics of "design"
fundamen-
The
statisti-
depend on the basic
assumptions that the research
is
being per-
formed as an experiment, that the experimental material and its responses can be reproduciblv identified, that "random samples" can be readily obtained, and that "random allocations" will provide satisfactory solutions to problems in "control" but
all
of these assumptions are either too
tabulations of a disease represent
tistical
that disease.
What
sent, of course,
these tabulations reprethe rate of diagnosis of
is
that disease, rather than
its
actual occur-
Because "disease" is a wholly intellectual concept of nosology, and because both the nosographv and diagnostic technology of disease are constantly changing, rence.
enumerations of rates of
statistical
"dis-
ease" cannot be scientifically satisfactory
accompanied by a
unless
satisfactory as-
naive or too erroneous for scientific design
sessment of the diagnostic procedures ex-
in clinical investigation.
tant at the time and geographic locale for each item of the reported statistics. 10 Another common fallacy in epidemiologic research based on "vital statistics" is
The concept cal
of an experiment. Statisti-
principles of experimental design are
many
observational
surveys, therapeutic surveys,
and methodo-
not pertinent for the
the
belief
that
the occurrence rate of a
becomes
logic explorations that provide the funda-
particular
mental data of clinical science. Since these
validated
survevs and explorations are neither de-
been reviewed to confirm the diagnosis whenever that disease was recorded on a
conducted
nor
signed their
as
"experiments,"
intellectual construction
sidered
in
statistical
is
descriptions
not conof "ex-
perimental design." trials
and probative experiments that can truly be regarded as experiments, statistical tactics
in
design are scientifically superficial
because they are based on a model that is oversimplified. These clinical investigations contain the profound complexity of a simultaneous dual experiment, in which a design of man is imposed on a "design" of nature for the purpose of explaining or changing nature's activities, but the statis-
models are planned to manage only one major experimental activity, not two. Reproducible identification of material. Statistical models also begin with the assumption that the experimental material can be reproducibly identified, but this elementary necessity of science cannot be tical
taken for granted in clinical epidemiologic research,
most
and
difficult
its
achievement
challenges in
one of the planning the is
research.
In
the
the
available
scientifically
evidence
has
does not detect the "false nega-
it
situations
tive"
in
which the disease
being
without
curred
diagnostically
ocre-
ported.
These errors are so ubiquitous and cast major
their
are so virulent that they
scientific effects
doubt on the validity of any of
the massive epidemiologic tabulations deal-
changing
with
ing
and
prevalence,
and with
diseases,
of
rates
mortality statistical
incidence,
diverse
for
conclusions
about causes of disease. The main scientific
challenge in the design of this type of
research
is
discerning
not in using the
changes
statistics,
created
but in
in
"dis-
by changes in the standards, dissemination, and application of nosologic and technologic principles of diagnosis. 10 Tn survevs and trials of therapy, the ease"
clinical
problems of diagnosis are different epidemiologic
from
the
cited.
In treatment, the main diagnostic
difficulties
just
sources of variability are not the diverse diverse
epidemiologic
surveys
depend on occurrence rates of disease at differ: nt times and places, a widespread contempo ry fallacy is the belief that stathat
if
death certificate. This type of confirmation can onlv eliminate "false positive" diagnoses;
Furthermore, for the therapeutic
disease
techniques
used to identify "disease" in and geographic locales, but
different eras
the diverse ways in which a particular set of techniques
is
applied by different physi-
Consequently, a
cians.
necessity
scientific
in any investigation of therapy
is
a clear,
precise statement of the criteria used for
diagnosis of the disease under treatment.
Despite
the recent attention given to
all
23
design of experiments
Statistics versus science in the
the hospital, the diagnostic facilities of the hospital,
variety of socioeconomic feapresence of a suitable investiga-
a
tures, the
tor at that hospital,
and the
patient's will-
ingness to participate in both the
initial
methods of therapeutic design, such criteria are frequently absent from the clinical literature. For example, adequate diagnostic criteria were omitted in 24 of 32 prominent clinicostatistical studies of
plans of his doctors and the subsequent
anticoagulant therapy for acute myocardial
be regarded
statistical
infarction 22
the pretherapeutic criteria of
;
were described inconsistently and nonreproduciblv in 26 prominent clini-
"operability"
costatistical
of surgery for carci-
reports 14
plans of the investigator, the patient
then appear as a unit of data in a
may
statisti-
cal series.
None
activities can possibly "random." Each of them is motivated (or biased) by the
these
of
strongly
as
many decisions made at each transition that moved the patient from one part of the spectrum 8
10 '
of the disease into an-
noma of the lung. The concept of "random sampling." One
other. Unless these decisions are carefully
most pernicious scientific delusions now prevalent in the world of medical re-
lection of patients with that disease will
of the
search
dom
the idea that concepts of "ran-
is
scientifically
classified, the statistical col-
be
meaningless because the re-
cannot be extrapolated. The patients
sults
sampling" can be readily applied to
represent no one except themselves; they
complete-
are a "sample" of a larger population that cannot be specified because of the many alterations created by such determinant
This idea
clinical populations. ly vitiated
is
by the use of patients
as the
"material" of clinical investigation, because
a
and
identified
patient
—unlike
an
agricultural
field,
features as iatrotropic stimuli, fashions in
chemical vat, or the material of any other type of experimentation chooses the in-
medical
—
"work-ups,"
diagnostic
criteria,
pretherapeutic criteria, the effects of co-
vestigator, rather than vice versa. Before a
existing diseases, the patients' acceptance
person can become a statistical unit in a study of disease, he must traverse a long
of the doctors' proposals,
and
statistics.
intricate
tions
that
anonymity
series
him
lead at
of directional transi-
home
from
to
his
his
medical
statistical
in-
clusion in a collection of data. After nature has created the disease, the diseased
person must be provoked, by symptoms or by other events, to see a doctor and to become a patient. According to the iatrotropic 8 10 and other stimuli, the doctor may '
or
may
not suspect the existence of that
disease in the patient. According to the intensity of the doctor's suspicions
and
his
13
and the chrono-
used
"date-marks"
logic
tabulating
for
and seldom classify or tabulate these determinant features, and persist in analyzing the statistical data with concepts based on "random sampling." In most enumerated surveys or probative experiments designed to demonNevertheless,
clinicians
statisticians
strate
the
patients
characteristics
of
hospitalized
with a particular disease, these
determinant features are rarely
on
this
scientifically
cited.
of the collection of patients, there
imposed the
Up-
defective description is
then
procedures and crinot be able to diag-
absurdity of calculating "standard errors" and "95 per cent
If the disease is suspected or diagnosed, the patient may then be referred to various other consultant
confidence intervals" to estimate the "true
available teria,
he
diagnostic
may
or
may
nose that disease.
doctors, tal
to
who
in turn
which he
is
may
select the hospi-
referred
for
further
"work-up." According to the consultants,
statistical
parameters" of the mythical "base population" of
which
of the disease
dom
sample."
of people
this is
Worse
who do
polygenous collection
assumed
to
be a "rangroup
yet, a "control"
not have the disease
is
of cohort research
The architecture
24
seem
generally selected, by even more defective methods, from the rest of the patients in the hospital, and the "parameters" are then estimated for the mythical "base
cause they offer the opportunity to assign treatment randomly (as well as to esti-
population" of the "controls."
lations).
The problem groups for
trol
tion"
mate the "parameters" of the treated popuUnfortunately, however, in order for the
of choosing suitable con-
type of "retrospective"
this
requirement be-
to fulfill this
maneuver
therapeutic
to
be the main
vari-
or "cross-sectional" study of etiology and
able in the "design," the statistical con-
pathogenesis of disease w
be considered
eepts require that the initial experimental
The main
material be "homogeneous." This require-
in
ill
a later paper of this series. to
point
be noted here
is
that
if
hos-
ment cannot be achieved
many
in the
clini-
pitalized patients with a particular disease
cal trials, particularly in chronic disease, in
are "random" representatives
cance" are constantly performed upon the
which the patients are extremely heterogeneous in their prognosis for whatever target is under treatment. By conducting a pathogressive "maneuver" in the outcome
"parame-
of disease concomitant with the clinician's
science
maneuver in therapy, nature creates the complex dual experiment that invalidates simplistic statistical designs based on a
of nothing,
hospitalized "controls" represent even
Nevertheless,
statistical
of
differences
estimated
ters"
"tests
populational
the
the
in
of
fallacious
less.
signifi-
and bizarre mathematical imagery of these "random samples." The statistical problems of
inappropriate "parametric estimations"
single
be
can tests,
avoided with nonparametric quantile methods, and other suitable
techniques, 2 natives
-,7, '
do not
M but
these statistical alter-
affect
the basic scientific
maneuver
appropriately identified
are
patients classified,
and
the treatment will be allocated
indiscriminately to "good risk"
whose basic
error of using the poorly identified "dis-
risk" cases
ease" and "control" groups for extrapola-
unrecognized.
more general population. The concept of "random allocation" A different facet of the random problem
randomization
tions to a
alone.
Unless the prognostic differences of the
This
and "bad
differences are left
type
of
may remove
assignment of treatment, but
promiscuous bias
in
the
it
also
re-
moves
clinical sense in the evaluation of
occurs in allocating treatment for a thera-
results.
The
In the types of research described in the previous section, the investi-
because a clinician will not know how to apply them in the future; he cannot determine whether "good risk" and "poor risk" patients responded the same
peutic
trial.
gator wants to compare a group of dis-
eased patients with other people who do not have the disease, and one of his main scientific problems is to ascertain that the
results will
be
clinically
mean-
ingless
way to each therapeutic agent. The results may even be clinically misleading because,
compared groups are properly representa-
as
tive of the external population of diseased
agent
discussed
elsewhere, 11
B may have
agent
exactly
the
A
and
reverse
or nondiseased people. In a trial of treat-
therapeutic effects in "good risk" and "bad
ment, however, the investigator deals only with people who have the disease. His choice of a "control" group can depend on
risk" patients,
his
own maneuver
in assigning therapeutic
on nature's maneuver in creating disease. In this situation, one of the investigator's main scientific concerns is that the "control" group and the "treated" groups be selected without bias. Statistical techniques of "random allocaagents, rather than
but the differences will be obscured when all the results are added up in a grand statistical conglomeration that ignores prognostic differences.
Aware of prognostic heterogeneity in pasome statisticians attempt to stratify
tients,
the treated population into "comparable" groups.
The
trouble with most of these
that the "comparable" is groups are usually selected according to stratifications
Statistics versus science in the
the sex,
demographic features of age, race, and instead of the clinical and paraclinical
phenomena that are harbingers of prognoThe problems of a suitably correlated
sis.
prognostic scribed
have been deand are beyond the
stratification
elsewhere 11
'
scope of the discussion here; they currently
one of the major
constitute
impede
stacles that
ob-
scientific
statistical
and
clinical
progress in the design of therapeutic
trials.
Reproducible identification of response and other pertinent data. The last item to be contemplated here is a common type of statistical
of his
serum cholesterol or electrocardio-
graphic waves; the "palliation" of cancer is usually reported according to survival
palliation of the patient's discomfort, dis-
a statistician generally
deal with continuous rather than
categorical variables,
and because of
his
desire for precise information, he generally
prefers "hard" rather than "soft" data.
A
—such as age, height, cholesterol—can be expressed
continuous variable
and serum
numerically as a dimension on an established scale that permits "continuous" gra-
A
—such
variable
categorical
as
occupation, ability to work, and severity of
name
angina pectoris is often expressed not in terms of what happened to the patient's chest pain or incapacitation, but in terms
quest for science. Be-
statistical theories,
chest pain
for getting meaningful results. For the sake of this apparent scientific and statistical convenience, the data used to assess post-therapeutic response may be "hard" and continuous, but inappropriate. For example, the treatment of disabling
necessity
and roentgenographic abnormalities instead of the true
clinical investigator's
dations.
—
expressed either as a titular
is
or an ordinal rank in a category that
time, white blood count,
and functional
tress,
The
enhance and perpetuate improve the quality of
sex,
and death; whereas
serum choles-
"soft" data con-
statements and judgments
tain subjective
information contained in "soft data," es-
judgmental parts, 8
remove or minimize observer variability in needed for those decisions. Instead, however, clinicians and accept the reasoning of
judgment" as a mystique that
beyond the reach
of analytic science,
is
and
ignore crucial judgmental information in favor of "hard" data that cally useless.
expressions
component
suitable studies to
collecting the data
quality of
categorical
into
decisions
and performing
cally satisfying
life.
by
tablishing rigorous criteria for "dissecting"
such as severity of pain and functional
To avoid
could
clinical science
preserving their attention to the important
statisticians often
age,
scientific deficien-
cies in clinical procedures. Clinicians
"clinical
like
on these inappro-
priate choices of crucial variables serves to
merical scale. "Hard" data consist of objective facts
status.
persistent focus
does not have continuous values on a nu-
terol,
25
advice that actually hinders the
cause of the concepts that underlie most likes to
design of experiments
may be
but clinically and
statisti-
scientifi-
and
subjective information, a statistician giving
advice about the design of clinical surveys
The architecture of
Not
all
clinical
statisticians
biostatistics
have become ob-
often concentrate on "hard"
sessed with a purely statistical approach to
data,
and particularly on data expressed in continuous variables. This choice of data
experimental design, and thoughtful bio25 statisticians have pointed out its follies
seems desirable because it avoids the scientific problems of using subjective information that is unstandardized and possibly
and
and
trials will
dangers. 38
statisticians
When
experienced
bio-
analyze the plan of a research
project, they generally concentrate
on judg-
statistical
ing the basic scientific principles that must
problems of developing suitable methods for analyzing categorical rather than continuous data. Unfortunately, however, this
be managed before any statistical theories of design can be applied. In a recent paper
choice also avoids the clinical and scientific
Stanley Schor 32 emphasized the need for
unreliable,
and
it
avoids
the
describing "misleading medical research,"
The
26
architecture of cohort research
careful statistical
judgment
in at least five
Did the researcher choose the
satisfactory for the realistic scientific archi-
tecture of clinical biostatistics.
basic prerequisites to scientific research: right people to
The difficulty has been well summarized by a mathematician- 9 :
question or experiment on? Did the researcher choose a statistical unit that
made
his
Did
problem solvable? use
researcher
the
a
control
group
and
choose and use it properly? Are the groups compared truly comparable? Did the researcher guard against a probable bias
in
the people he was
A
frequent cause of incompatibility between the and reality is the neglect or confusion
model
among
.
of these five issues as well as
of the other scientific
many
principles cited here
and elsewhere1 arc constantly mismanaged in clinic. il investigation, because a distinct methodologic discipline has not been estab-
lished for this type of research.
epidemiologic investigator
or
A
clinical
who wants
imposition of
the
.
.
.
Each
or
on variables. ... A proof within a mathematical model proves nothing in biology. The choice or design of a valid mathematical model is dictated bv what is
bounds
.
testing?
variables
influential
unrealistic
needed
describe
to
.
the real situation.
.
.
This
.
requires a well-structured imagery, based on deep
knowledge and understanding of the real situation, and it is precisely here that the major influence of workers in biology must be felt. My choice of the term imagery was not accidental. It was chosen to connote the imaginative, interpretive, even poetic outlook which biologists must provide. .
.
.
.
.
.
ad\ ice about suitable ways to choose control
groups, define statistical units, avoid establish
bias, bility,
criteria,
arrange compara-
and perform many of the other pre-
A
methodologic research discipline
clinical
will require
biostatistics
imagery"
structured
that
in
a "well-
represents
the
The
published specifications for the activities. literature contains many statistical
and poetic art. Since this fusion is better suggested by the term architecture than by the existing
precise instructions for tests that quantify
name
the operations of chance, but almost no
ments, and explorations discussed in this paper can be regarded as an outline of the
requisites to scientific research cannot find
instructions
make
for
the
judgment needed
to
fusion of realistic science
design,
structures
decisions of science.
Clinical biostatistics has thus
been ob-
the
surveys,
produced
clinical research.
in the architecture of
The next few papers
scured in the mystiques of two types of nondescript "judgment" an artful clinical
this
"judgment" whose rational characteristics are omitted from clinical publications dealing with medical science, and a scientificstatistical "judgment" whose operational details are omitted from statistical publications dealing with artful design. An able clinician constantly' uses clinical judgment in his investigative decisions but does not describe its components; and an able statistician engages in an analogous occultation of the statistical judgment used to
statistical architecture.
—
plan research.
now matured
into a sacred
traffic."
The
way
1.
of
this
bio-
An
Experimentation:
introduc-
to
Hall, Inc. 2.
Bradley, tests,
V.:
J.
Englewood
Distribution-free Cliffs,
N.
J.,
statistical
1968, Prentice-
Hall, Inc. 3.
Chapin,
F.
ciological
revised
& Bow,
designs
ed.,
in
New
so-
York,
Publishers.
Cochran, W. C, and Cox, G. M.: Experimental designs, New York, 1950, John Wiley
& 5.
Experimental
S.:
research,
1955, Harper
Sons, Inc.
Cox,
D.
Planning of experiments,
B.:
York, 1958, John Wiley 6.
ments
details
measurement theory and experiment design, Englewood Cliffs, N. J., 1962, Prentice-
current collection of abstract
models for the design of experiis not been and cannot become
statistical
C:
Baird, D. tion
cow which of scientific
in
to operational
References
4.
as often as not gets in the
and other
principles
"Mathematics," said Prof. P. G. H. Gell, 21 "has
be devoted
series will
experi-
trials,
Edwards, A. chological
L.:
ed.
3,
& Winston,
New
Sons, Inc.
Experimental design
research,
Holt, Binehart
&
New Inc.
in psy-
York,
1968,
Statistics versus science in the
7.
York, 8.
Experimental
T.:
therapy for acute myocardial infarction, 280:351-357, 1969. J. Med.
Clinical judgment, Baltimore,
:
A.
epidemiology:
Clinical
R.:
The populational experiments
man
human
in
of
I.
24.
A.
The
R.:
identification
tern.
The
epidemiology:
Clinical rates
of disease,
Rinehart
II.
Ann. In-
25.
Med. 69:1037-1061, 1968. A.
11. Feinstein,
of
statistics
in
III.
therapy,
26.
Ann. Intern. Med. 69:1287-1312, 1968. A. R., Koss, N., and Austin,
12. Feinstein,
The changing emphasis
M.:
in
J.
clinical
Sons, Inc.
ed.
temporal demarcations, 123:323-344, 1969. 14. Feinstein, A. R.,
and
Data, decisions, and Arch.
Spitz, H.:
Intern.
Med.
The epidemi-
ology of cancer therapy. I. Clinical problems of statistical surveys, Arch. Intern. Med. 123:
D.
Finney,
J.:
statistical basis,
of
R. A.: Statistical
methods
workers, Edinburgh, 1925, Oliver
19.
9.
20. Fisher, R. A. 8,
:
The design
of experiments, ed.
Edinburgh, 1966, Oliver
& Boyd,
Ltd., p.
P.
quoted
G.
H.:
Research
and
imagination,
Lancet 1:273, 1969. 22. Gifford, R. H., and Feinstein, A. R.: A critique of methodology in studies of anticoagulant in
W.
B.
statistics,
Saunders
1968,
Wadsworth Publishing
Inc.
Nooney, G. C.: Mathematical models, reality, and results, J. Theor. Biol. 9:239-252, 1965. 30. Peng, K. C: The design and analysis of scientific experiments, Reading, Mass., 1967, Addison-Wesley International Division. 31. Quenouille, M. H.: The design and analysis of experiment, London, 1953, Charles Griffin Ltd.
How to evaluate medical research Hosp. Physician 5:95-109, 1969. 33. Siegel, S.: Nonparametric statistics for the be32. Schor, S. S.: reports,
New
havioral sciences, Hill
Book Company,
34. Snedecor, scientific
York, 1956,
McGraw-
Inc.
G. W. The statistical part of the method, Ann. N. Y. Acad. Sci. 52: :
792-799, 1950.
The mathematics of experimental Incomplete block designs and Latin squares, London, 1967, Charles Griffin & Company, Ltd. 36. Winer, B. J.: Statistical principles in experimental design, New York, 1962, McGraw-Hill 35. Vajda,
S.:
design:
33. 21. Gell,
1963,
29.
Ltd.
A.:
Elementary medical
Calif.,
Company,
Edinburgh, 1935, Oliver
18.
relation-
W.: Introduction to linear models and the design and analysis of experiments,
for research
& Boyd,
The design of experiments, & Boyd, Ltd. Fisher, R. A.: The design of experiments, ed. 8, Edinburgh, 1966, Oliver & Boyd, Ltd. Fisher, R. A.: The design of experiments, ed. 8, Edinburgh, 1966, Oliver & Boyd, Ltd., p. R.
17. Fisher,
D.:
& Company, Experimental design and its Chicago, 1955, The University
Chicago Press.
16. Fisher,
The
28. Mendenhall,
171-186, 1969. 15.
theory:
Statistical
Philadelphia,
2,
Belmont,
course:
York, 1964, Holt,
Company.
66:396-419, 1967.
clinical
New
Inc.
don, 1957, George Allen & Unwin. Kempthorne, O.: The design and analysis of experiments, New York, 1952, John Wiley &
27. Mainland,
I.
The
L.:
H.
13. Feinstein, A. R., Pritchett, J. A., and Schimpff, C. R.: The epidemiology of cancer therapy. II.
Hogben,
re-
Topics under investigation. An analysis of the submitted abstracts and selected programs at the annual "Atlantic City Meetings" during 1953-1965, Ann. Intern. Med. search.
& Winston,
ship of probability, credibility and error, Lon-
epidemiology:
Clinical
R.:
design
clinical
A. M., and Cox, D. R.: Recent work on the design of experiments: A bibliography and a review, J. Royal Statist. Soc. (Series A) 132:29-67, 1969. Hicks, C. R.: Fundamental concepts in the
design of experiments,
69:807-820, 1968. 10. Feinstein,
New
23. Herzberg,
nature and
Ann. Intern. Med.
illness,
27
Eng.
The Williams & Wilkins Company.
Feinstein,
of
New
design,
The Macmillan Company.
1955,
Feinstein, A. R.
1967, 9.
W.
Federer,
design of experiments
37.
Book Company, Inc. Wortham, A. W., and Smith, T.
E.: Practical
experimental design, Dallas, 1960, Dallas Publishing House. 38. Zelen, M.: The education of biometricians, Amer. Statis. 23:14-15, 1969. statistics
in
CHAPTER
3
Components
of the research
objective
Because the
fundamentals of in most statis-
scientific
omitted
clinical research arc
discussions of the "design of experi-
tical
ments,"
new
a
be developed
must
scientific architecture
as a
methodologic discipline
for planning clinical
investigations.
These
is
arranged quite differently from the other and will not be considered further
activities
which is confined to the construction of clinical survevs and experiin this discussion,
ments.
The
operational concepts
are de-
that
They have
investigations can be constructed in several
scribed here are not original.
ways, according to the events, problems, and challenges contained in the
been expressed, in whole or part, in various treatises on scientific methods of research, 1 3 7 •-». «- "• i0 18 2 ° and although
different
experiments of nature and of man. In the previous paper of this series, I outclinical
-
"
-
given
-
-
scant
attention
statistical texts, the principles
which include the following: surveys of
extensive
nature's activities in preserving health, or
Donald
creating
in
and evolving human
surveys and experimental therapeutic attempts
to
trials
disease;
of man's
intervene in the
course of nature; explanatory experiments
discussion
Mainland's
Statistics.^ 2
My own
in
conventional
in
lined those diverse investigative structures,
have received
the
half
first
Elementary contribution
of
Medical to ar-
is
range these principles in a different way, based on the sequential events of an experiment. This sequence occurs as
maneuver of nature or demaneuver of man; and exploratory efforts to assess and improve the meth-
that probe a lineate a
Initial
Subsequent
Maneuver
State
>
State
odology used in the foregoing research
and
activities.
series
are
architecture
is
de-
concerned with the "architecused in constructing those
troduced or combined at various stages in the contemplation of the sequence. The
The
just cited
investigative
signed during ten operations that are in-
its
two successors
tural" operations projects.
its
in this
This paper and
last of
the research structures
—the methodologic
first
operation
is
outlined in this paper;
the other nine operations will be outlined
exploration
in the next
two papers of
this series;
and
further details of the operational structures This chaptc ///.
The
originally a.
Pharmacol. Th
28
lecture
appeared as "Clinical of
clinical
11:432, 1970.
biostatistics
research."
In
—
Clin.
and processes quently.
will
be
discussed
subse-
Components
of the research objective
29
are beyond
the scope of this dissuch judgments are the "common sense" used for intermediate decisions during the that
plexities
cussion.
Among
and the basic original is worth doing. we agree that the work should be done, we
architectural construction,
decision that the research project
INITIAL
If
STATE
STATE
can then proceed to
its
plans.
At the onset of design
The
operational principle in plan-
first
ning a research project
to stipulate the
is
Although this principle seems obvious, or perhaps beobjective
cause
the
of
it is
so obvious,
ineffectually
reports
or
of
In meeting with a
investigation.
clinical
managed
often
it is
proposals
in
consultant,
biostatistical
may
research.
an
investigator
expansively describe the background
proposed research, the reasons he wants to do it, and its presumptive importance, but he often does not specify the components or the logic of what he wants to do. The specifications are also often omitted when a completed project for his
The
reported in the literature.
is
tor's failure to
ment
make
investiga-
a clear, precise state-
of the objective of the research
detrimental both to the biostatistician helps design a project,
and
is
who
to the reader-
investigation, the objective
form of a question. "What happens," may say, "when something done to something Y?" In this original
the investigator
X
is
statement
the
of
research
question,
the
alwavs
indicates
the
almost
investigator
maneuver under surveillance, and he usually also mentions or implies the principal
the population. Regardless
initial state of
how
of
well the question has been stated,
however, details
it
almost never contains
needed
to
Consequently, the
main principle of tecture is to expand
biostatistical archi-
the
and subsequent and to establish the necessary distinctions of the principal and subsidiary maneuvers that may be emof the initial state
tions
state of the population,
nostic
ployed.
A. Initial state. ulation requires
The
two
initial state of a
and logic for the maneuver, and the subsequent state of the observed population are prerequisite to both the scientific methods and the artistic description
taste that enter into the architecture of a research
project.
the
The
investigator's
inevitably
become misdirected
the objective taste
is
methods are used to achieve objective, and the work will
scientific
is
needed
or
incomplete
The
not suitably specified.
— a diagnostic account of the existing conanticipation
of
the
and
for
ods and procedures. This discussion
because
with the
described.
logical issues
issues
in
is
making
concerned method,
crucial
roles
of
and
"artistic"
judgment will be noted when they appear, but the performance of the judgments involves com-
of
For the
illustrations that follow, let us consider a
mythical 1.
new
drug, Excellitol.
Diagnostic demarcation. Suppose an
investigator asks the question, "Is Excelli-
good for angina pectoris?" From the wording of this question, we do not know tol
the
initial
state
of
the
population.
We
whether the purpose of the therapeutic maneuver is remedial, to relieve angina pectoris whenever the pain cannot
tell
occurs; or contrapathic, to prevent angina
scientific
can be readily defined
The many
state.
if
intermediary decisions during the choice of meth-
mainly
and a proglikelihood
artistic
for evaluating the general impor-
tance and worth of the research,
pop-
different specifications
achieving the subsequent the
investigator's
original statement into adequate specifica-
objective.
state,
the
of a research project. first
is
An adequate
all
implement the objective
ditions of health or disease,
initial
usually stated
in the
who later decides whether the worth doing or who appraises its completed results, because all of the detailed planning and analysis of the research depend on the original stipulation of the reviewer
work
a clinical
for
is
in
people
tratrophic,
who have to
never had
it;
or con-
prevent an adverse course
of coronary artery disease in patients
have had angina.
who
The
30
architecture of cohort research
search question would be: "Does Excellitol retard the growth of normal children?"
into two different groups within which therapy can be assessed. 3 4 The stratification performed to divide
Aside from the later problems of defining what is meant by "retard" and "growth,"
thus used not to create different "factorial
Another unsatisfactory wording of a
re-
question docs not indicate what kind
this
rated
'
these
prognostically
blocks"
same experiment, but
the
in
groups
disparate
is
to
state
establish the initial populations for different
of the children. Are the)' to be "normal" in
experiments, within each of which the same therapeutic maneuvers may then be imposed. Because of these distinctions,
of normality
is
sought in the
initial
general physical health, or must they also
be free of any mental, psychic, or biochemical abnormalities?
As the tions,
first
step in architectural specifica-
therefore,
the initial state must be
groups of patients with contrasting prognoses should not be statistically mingled in
an analysis of variance as though they
diagnosticallv demarcated according to the
had
"interactive
types cf healthy or diseased people that
risk"
population forms one experiment; the
are to be assembled for the investigation.
"bad-risk" forms another; and their
Prognostic stratification. In the type
2.
population was diagnostically impre-
cise.
A more
ficulty occurs
and more common
subtle
when an
dif-
investigator asks the
question, "Does Excellitol lower the fatalits-
rate in patients with acute myocardial
infarction?"
In
situation,
this
the
diagnostic state of the population
is
intended
to
imprecise because
it
gressive
activities
of
prevent death in
dis-
does not specify the
the broad clinical spectrum of acute
myo-
of man's
therapeutic
lactic
goal to be achieved
target to
the patien
re
not comparable.
They
by the
treat-
selected as a particular response
is
plan the architecture of a clinical research project, the targets in the
nostically correlated.
myocardial infarction," their outcomes are so disparate that
this
be evoked by the chosen maneuver. To
Because of nature's underlying "maneuvers" in producing and evolving the course of a myocardial infarction, the disease has a diverse clinical spectrum of patients with major differences in their prognostic anticipations. The fatality rate, although quite low in 'good-risk" patients who have had no symptoms other than transitory chest pain, is quite high in "bad-risk" patients with shock, pulmonary edema, and/ or sipiificant arrhythmias. 15 17 Although both s of patients share the diagnosis '
activities,
ment. In an explanatory experiment, the
must be
anticipat
target
target represents the remedial or prophy-
cardial infarction.
"ac
this
trial
types of patients to be considered within
of
nature,
represents nature's production of various states of health or disease. In a survey or
eased patients. Nevertheless, the question is
maneuver. In a survey
of the experimental
of the ontogenetic, pathogenetic, or patho-
speci-
and the purpose of the therapeutic maneuver is clearlv noted as contratrophic, it is
B.
state of the population serves as the target
initial
fied
since
statisti-
must be planned accordingly. Subsequent state. The subsequent
cal appraisals
of problem just cited, the state of the initial
The "good-
variables."
subsequent state
identified, differentiated,
and prog-
1. Identification of targets. Since a person can change or react in innumerable ways while under observation, and since an investigator could not assemble detailed information about all of these possible re-
sponses, the targets of the research
must
be identified to denote the basic observations that are to be performed and analyzed. In all of the examples that follow, the primary target of the research
maneuver
is
inadequately cited in the ex-
pressed question, and the reasons are noted in
the associated parenthetical question:
rep-
resent two fferent "experiments" instigated by nature, and they must be sepa-
"Is
Excellitol
kind of hazard?)
hazardous
to
health?"
(What
Components
"What mal
for
is
the optimal dose of Excellitol?" (Opti-
what goal?)
"Is Excellitol a good analgesic agent?" (For what type of pain?) "Is Excellitol good for angina pectoris?" (To
prevent angina or to relieve
of the research objective
another way according to their risk of developing thrombophlebitis; similarly, in patients with acute myocardial infarction,
the prognostic classifications for the duration of chest pain are different
it?)
In addition to a clear identification of all
that
so
maneuver must be specified, suitable plans can be made for
observation of the subsequent state. Differentiation
2.
of
Although
targets.
maneuver of an experiment has a primary target, many other events can be studied as ancillary targets, and additional phenomena may be noted as incidental targets. For example, although the primary target of Excellitol therapy may be the
may
angina pectoris, the investigator
want
examine such ancillary medication, the convenience of administration, and the occurrence of symptomatic side effects. Furthermore, although Excellitol may not be expected to affect the white blood count or blood urea nitrogen, these entities may be chosen for observation to ensure that they have not incidentally received an adverse effect from the also
to
targets as the palatability of the
drug.
Thus, in
listing
all
the properties that
be observed as possible targets in the subsequent state, the investigator must
are to
differentiate
incidental
the
role
of
primary,
ancillary,
and
the targets in the re-
search. 3.
Correlation of prognostic strata. One main reasons for differentiating the
targets of
an investigated maneuver
is
to
provide an appropriate correlation for performing the prognostic stratification of the state.
Since
members
population will have for
stratification
distinctions,
cannot
prog-
a
be designed
arbitrarily according to age, race, sex, etc.,
instead,
be planned
in
direct
correlation with the particular target under surveillance. 4
particular
If
the primary target of a
maneuver
is
prevent event A,
to
the initial population should be stratified
according to their risk of achieving event A. If a different event, B,
is
encountered
a "side effect" of the maneuver,
as
the
proper evaluation of the results requires that the population receive a completely
separate
stratification
according
to
the
risk of achieving event B.
C. Maneuver. All of the features just
and classification of the population before and after the maneuver. The next step in architectural design deals with the logic needed to ensure that the maneuver is suitably applied and evaluated. This logic is based on the maneuver's potency and on the comparison, multiplicity, and concurrence cited refer to a suitable identification
of additional maneuvers. 1.
Potency.
refers to
its
The potency
of a
maneuver
capacity to achieve the desired
target state. This capacity
may depend on
the dose or intensity of the maneuver, and
of the
initial
Because of these nostic
and must,
the
relief of
vival.
of the other possible
targets of the
from the
that demarcate 30-day sur-
classifications
the primary target,
31
different
classification
targets,
of the
different
the
same
prognoses
selection
and
of the variables used for a
particular prognostic stratification will de-
pend upon the chosen target. Thus, the same group of healthy women might be classified in one way according to their likelihood of becoming pregnant, and in
on
its
ample,
manner if
For exwhether cigarette
of administration.
we wish
to test
smoking causes a particular disease, what be the acceptable quantity of cigarettes for someone to be regarded as a "smoker"? Should the classification of nonsmoker or stopped-smoker be used for a person who has not smoked for decades after having consumed a total of several packages at diverse times during adolescence? Should we distinguish cigarette smokers who "inhale" the smoke from those who do not? A more subtle problem in potency of will
the
maneuver
deals with pharmaceutical
The
32
agents
architecture of cohort research
—such
as
and
coagulants,
aspirin,
digitalis,
insulin
— that
may
anti-
require
a different dosage in different people to
produce the desired effect.
these agents
If
are given at a fixed dosage to
the pa-
all
an investigation of therapy, the clinical activity is improper because some patients will receive too much of the drug and others will receive not enough. On the in
tients
hand,
other
agents
the
if
are
given in
variable dosage, the statistical analysis
seem
may
be confounded by the necessity
to
to evaluate the
many
different dosages of
same treatment.
the
An analogous type of problem arises if we want to compare surgical versus meditherapy of a particular disease, but each surgeon may want to per-
cal
find that
form the operative procedure
in a slightly
different way, according to his
ular skills
and judgment.
If
are denied the opportunity to
own minor
variations
in
own
partic-
employ
their
performing the
not be carried out in an optimum
manner by each
of the individual opera-
In the
first
type of situation (choice of
an appropriate "dose" and "inhalation" of cigarettes), there is no scientific solution for the problem. The decision must be made according to whatever seems sensible to the investigator
and acceptable
to the
people who review the results. In the second and third type of situation (dealing with variable doses of drugs or minor modifications of a surgical procedure), the most scientifically appropriate decision is to recognize that
the
surgical
we want operation
its
individual
in
the
surgical
way
of performing the selected
maneuvers. The statistical appraisals are not confounded by this decision, because our objective is to test a particular type of man> 'iver, rather than each of its minor variatioi
"potency." 2. Comparison. When an investigation is performed for purely descriptive pur-
poses
—as
for example, in surveys of the
growth, development, or treatment of a state of health or disease
Accordingly,
—no
comparative
maneuver need be considered. Examples of research containing no comparative maneuvers are the investigations conducted to answer the following questions: "Does blood pressure increase as healthy people grow older?"; "What is the incidence of pulmonary embolism in hospitalized patients?"; and "What have been the results
of cardiac transplantation?"
many
In
other types of clinical research,
however, a comparative maneuver is employed as part of the necessary scientific logic of the experiment. We could not demonstrate, for example, that Excellitol is
hazardous to health unless
ing
of
Excellitol
pectoris unless
was
compared.
in
data
of not tak-
Nor could we
Excellitol.
value
we had
maneuver
assess
relieving
the
angina
some other mode of relief For many investigative
the complete archi"maneuver" involves the choice of comparative maneuver(s) as well as the particular one under prime contherefore,
situations,
tecture
of
the
sideration.
The
scientific
logic
used in
choosing
careful consideration of the relativity, con-
procedure, because this flexibility provides the best
gard the administered spectrum of the drug or operation as a single form of treatment that has been given in optimal
drug
we would
modifications
or re-
optimal
to test the at
dosages
and we would
different
comparative maneuvers is a crucial feature in the design of research, and requires
choose a dosage for the drug and allow
"potency"; therefore, flexible
the
of
for the comparative
tors.
or
results
operative procedures,
the surgeons
operation, the general surgical procedure
may
the
we would combine
stituents,
and environment
of the compari-
son. a.
relativity.
The comparative maneu-
ver must be specifically related to the ques-
be answered in the research. If the question deals with efficacy of the experimental maneuver, the appropriate comparison should contain essentially no maneuver or a placebo, but if the question deals with efficiency, the comparative procedure tion to
i
Components
should contain a deliberately comparable
maneuver. For example, know whether Excellitol lieving headache, tion,
and
its
to
unless they are administered in similar environmental surroundings.
capable of reis under ques-
For example, one of the main roles of placebo treatment is to ensure that both
we want
if is
efficacy
should be compared against a
it
placebo; but
we want
if
to
know whether
works better than existing
Excellitol
anti-
headache preparations, its efficiency should be compared against an established agent, such as aspirin. b.
ings
The internal surroundmain maneuver represent its
ritual,
therapeutic
occur
if
differences
the associated anesthesia; for a pharmaceu-
the
agent,
medium
include the
constituents
whereas the patients who
smoking, the constituents include the pa-
chair rest, stockings.
Since the scientific purpose of compari-
son
is
expose the compared groups to
to
maneuvers that are identical
in every
way
except for the "active ingredient," the con-
main maneuver must be its comparative maneuver is planned. For example, a sham surgical procedure is not adequately comparable to stituents
of the
considered
when
the real surgical procedure
performed agent;
if
with
Excellitol
cial solution
if
the
a
different
is
dissolved in
sham
is
ignored the environmental
of ancillary therapy occurred
surveys of anticoagulant treatment for myocardial infarction. 8 The patients who received "no treatment" in the pre-anticoagulant era were usually kept at prolonged
per in which the tobacco
wrapped.
that
in
bed
is
expectations
that
which the active ingredient is conveyed; for a maneuver such as cigarette in
personal meetings,
simply "no treatment" rather than a placebo. An example of inappropriate comparisons
For a surgical procedure,
rest,
early
ambulation, and elastic
A more subtle problem in the analysis of environment occurs when an extraordinary personal effort
is
necessary for a patient to
adhere to the assigned maneuver. Thus, a patient's maintenance of oral medication ordinarily creates no psychic difficulties or discomfort except for his remembering to take the medication, but adherence to an unappetizing or displeasing diet quire a heroic resolution that
some
found
spe-
most people.
in
efforts are
When
is
may
such heroic
required to comply with a ma-
the persons
who
placebo
neuver, able to
tion.
other distinct characteristics that will
environment. The external surroundings of the main maneuver represent its
their
the state of people
"environment." In a maneuver of clinical
the prescribed
the
comparative
c.
therapy, the environment
would
include:
the home, hospital, or other setting in the treatment
is
which
given; the ancillary treat-
ment that may be employed in addition to the main maneuver; the personal efforts necessary for the patient to adhere to the treatment;
and the frequency and
sity of the
interchange between the doctor
inten-
and patient. Thus, even if the comparative maneuver has a composition essentially identical to that of the main maneuver, the two maneuvers are not truly comparable
may
make
in
are
the special effort
postmaneuver
results
re-
seldom
should also be dissolved in that same solu-
travenously,
re-
ceived anticoagulants also often received
anesthetic
before being administered in-
to the
and would not the comparative maneuver were
medicinal
of the
the constituents include such features as
tical
and physician are exposed
patient
constituents.
"constituents."
33
of the research objective
state
willing
or
may have
different
make from
who cannot adhere
to
maneuver
—but the different
the adherers
and nonadherers
then be fallaciously ascribed to the
maneuver. For example, suppose it is true that life is prolonged by such "healthful" patterns of living as regular amounts of sleep and exercise, and suppose that people who are compulsive enough to adhere to an unappealing diet will also assiduously maintain "healthful" patterns of living. Now suppose we want to test whether continued use of a low-sulfate diet, although extremely distasteful, can lower mortality
The
34
The
rates.
architecture of cohort research
results of the research
people
that
who
maintain
is
live
needed
efforts
to carry out the execution
of the long, complicated project.
that the compulsively "healthful"
for assessing the results of the comparison.
who would
people,
diet
subtle problems in the scientific logic used
but what
diet,
pened
may show
who do not adhere to may really have hap-
longer than those the
this
longer anyhow,
live
were the onlv persons capable of adhering
To
to the diet over long periods of time.
The use
we
If
of multiple
maneuvers creates
are interested in only one subse-
quent
target, the
compared maneuvers are
comparable maneuver would involve the use of an equally distasteful diet that is not low in
many
sulfate.
decisions, the effect of each
avoid
3.
problem, a
this
strictly
The next issue to be number of maneuvers
Multiplicity.
sidered
the
is
maneuvers
—a
wants
presumably
to
agent versus an old one
—but
in
new
certain
are examined.
when
the investigator seeks to find a "best"
or "worst"
among
several agents;
when
the
performed to explain the action of a particular maneuver; or when the
experiment
is
research project
is
particularly difficult to
execute, so that the investigator wants
it
answer as many questions as possible. An example of the search for a best agent would be the comparison of Excellitol, aspirin, buffered aspirin, and placebo to
headache; another such extest of Excellitol given different dose levels to determine
in the relief of
target
to lowest, or "best"
When more
is under scruhowever, the analysis of results becomes complicated by the different ratings
that
each maneuver can achieve for
Suppose maneuver
that the primary target of our is
or in diverse combinations, to determine
which of the ingredients is necessary to produce an effect noted after administration of the intact composite solution. In a
large-scale cooperative study of the treat-
many
years of observation for hundreds of patients at different hospitals, the investiga-
tors
diets
as
might want to test several different and drug regimens in order to gain
much
information as possible from the
to relieve headache, but
we
also intend to assess such ancillary targets
speed of
as
relief,
cost
of
drug,
palat-
and adverse side effects. If we are comparing only two drugs Excellitol versus aspirin we might find that Excellitol acts somewhat more rapidly and tastes somewhat better than aspirin, but costs ten times as much and has many more adverse side effects. Since neither science nor statistics provides a method for deciding which of these results is most ability of drug,
—
—
how they counterbalance one we must make the decision as an
desirable or
gredients of Ringer's solution, given alone
its
individual effect on the different targets.
another,
which dose produces the best results. An example in which multiple maneuvers are employed to explain a mechanism of action would be the comparison of various in-
through "worst."
than one target
tinv,
at five
of a chronic disease, involving
the
only a single
considered.
is
ample would be a
ment
to
down
more than two maneuSuch situations occur
types of research, vers
when
of ratings
due
that
agent
active
simplicity
is
con-
compare two
versus a presumably inert one, or a
are used. This ease
Without any subtle maneuver on the single target can be ranked as highest
will be contrasted. In most research projects, the investigator
how
easy to assess, no matter
relatively
judgment or clinical "common Because only two maneuvers are involved, the judgment is not particularly difficult. We can readily compare and "weigh" the effects *of the two maneuvers on each of the multiple targets, and, in this case, we might conclude that aspirin is the better agent because the increased costs and side effects of Excellitol outweigh its slight advantages in other feaact
of
sense."
tures.
On
the other hand,
if
the research in-
volved a simultaneous comparison of five different antiheadache preparations rather than two, difficulty in
we would have much making a
similar
greater
judgment be-
Components
cause each agent might rank differently for
each of the four targets under consid-
eration.
In trying to
make
decisions that
ing maneuvers all
if
35
of the research objective
possible
would require consideration comparisons were to be
checked:
involve multiple concomitant "weighings," the standard judgmental procedures
"common
we
Excellitol alone
use
Whisky alone
by a
Excellitol
complexity seldom encountered in our or-
Excellitol
for
sense" are confronted
dinary acts of thinking, where restrict
two For
we
Excellitol
usually
Excellitol
our simultaneous comparisons to
Excellitol placebo only
alternatives rather than a multitude. this reason,
when
a
be assessed. In largescale clinical surveys and trials of therapy, where many different effects and side effects must be evaluated in the target state, the use of multiple maneuvers may often be unavoidable, but will greatly increase the difficult)' and complexity of the subsequent analysis of data. Another problem in multiple maneuvers arises because the comparisons may be single target
is
Whisky placebo only
multiple maneuvers can be
used most advantageously only to
either equivalent
or additive.
Equivalent
maneuvers are expected to act in essentially the same way or to produce the same type of effect, and can be used as replacements for one another. Thus, in the previous example of a four-way test of Excellitol, aspirin, buffered aspirin, and placebo in the relief of headache, each of the four maneuvers might be regarded as equivalent, and the "winner" of the test would then be preferred instead of the
and whisky and whisky placebo placebo and whisky placebo and whisky placebo
No
treatment
manv
Since the above array contains too
maneuvers
the
for
practical
number
clinical research, the
realities
of
of compari-
sons must be sharply reduced, and, since no rules of science or statistics are available for the reducing procedure, another act of logic is needed to choose the maneuvers that are most cogent for the specific
question of interest to the investiga-
tors. If
of the fective,
the objective
is
determine which
to
two agents or both
ample might be reduced Excellitol
Excellitol Excellitol
4.
is
the most ef-
the comparison in the above ex-
and whisky and whisky placebo placebo and whisky
Concurrency. The
sidered here
is
to:
last issue to
be con-
the concurrency of the comJ
pared maneuvers. In most experimental
sit-
ma-
maneuvers are tested concurrently in people who receive the compared maneuvers during essentially the same
neuvers contain agents that are expected
period of time. In certain circumstances,
others.
On
the other hand, additive
produce different results, so that the maneuvers may be combined. Thus, if we want to compare Excellitol versus whisky in the relief of headache, we might assume that these two agents act differently, and we might also want to assess the effects of a combination of Excellitol and whisky. When additive maneuvers are employed, the choice of appropriate comparisons is made difficult by the need to establish a suitable comparison for each of the different maneuvers, alone or in combination. For example, in the Excellitol-whisky test that was just described, each of the followto act differently or to
uations, the
however, the maneuvers istered serially.
The
mav be admin-
in "crossover" therapeutic trials,
same person
is
occur
serial situations
where the
successively exposed to dif-
ferent maneuvers, or in "conditional" ex-
periments where maneuver
maneuver
A
B
will follow
provided that maneuver
elicited a certain result.
"conditional" situation
is
A
An example
has
of a
the use of pre-
operative radiotherapy for certain cancers:
The surgery if
is
performed afterward only
the patient has remained "operable" at
the
conclusion of the radiotherapy.
The
most common type of clinical research involving nonconcurrent maneuvers is in the
The
36
architecture of cohort research
"dvstemporal" surveys
compare the
of
treatment
new
results of a
agent with the results obtained
that
surgery
many
comparison
years
group of patients treated with older methods. For the scientific validity of comparison, maneuvers all of these nonconcurrent create hazards that arise not from the maneuvers, but from differences in the initial population or in the ancillary maneuvers earlier in a
employed at the different times when the compared maneuvers were actually administered.
A
serial MANEUVERS.
a.
"crossover"
trial
based on the assumption that the patient has the same initial state each time he is exposed to the next maneuver. of therapy
is
This assumption reality
seldom valid
is
because
of
compared against the two maneuvers of followed by radiotherapy. This
therapeutic
in clinical
effects
"carry-over''
gery only
obviouslv not valid because
is
radiotherapy
usually ordered after sur-
is
when
the operation disclosed in-
The patients without metastases are seldom given postoperatrathoracic metastases.
but also have better prog-
tive irradiation,
noses than those verse example
who
receive
A
it.
con-
provided bv trials of preoperative irradiation for lung cancer. The patients who can go through the time required for the radiotherapy, while still is
keeping their tumors anatomically operable, may have slower growing cancers, and hence better prognoses, than the pawith diverse rates of growth
tients
ceive surgery immediately.
able
comparison
for
who
A more
preoperative
re-
suit-
radio-
who
created in the target by the agents used
therapy might be a group of patients
maneuver, or because the state may have been patient's clinical changed by the preceding treatment or by
remain "operable" and receive surgery after a time delay equal to that consumed by
in the previous
radiotherapy.
the evolving course of his disease. For this reason, the various Latin, Graeco-Latin, or
other "squares" that are so popular in statistical
discussions
of
"experimental
de-
sign" are frequently not applicable in clini-
A
C.
DYSTEMPORAL MANEUVERS.
is
compared,
period elapses between the successive ther-
carefully
"crossover"
apeutic agents, and turns to the
same
each treatment
is
when
trial
the patient re-
in a therapeutic survey,
may
dystemporal comparison different
research.
a
with
another treatment used at a later era, the
can be valid only when a long enough "wash-out"
cal
When
treatment used during one period of time
types
of
create
unless
bias,
the
two two
populations exposed to the maneuvers are fication
checked for prognostic stratiand environmental effects. Because
initial clinical state after
the passage of time
completed. Examples of
ments both
in
may
bring improve-
diagnosis and in ancillary
possibly suitable populations for crossover
aspects of therapy, these temporal improve-
who
ments, rather than the differences in the
trials
are essentially healthy people
are persistently asymptomatic, or
who have
main treatment, may be responsible
for the
a predictively recurrent symptom, such as
better results found in the group treated
dysmenorrhea.
at
CONDITIONAL MANEUVERS. In Studies of conditional maneuvers, major bias may be
earlier
B.
created by the
demand
that the patient
attain a particular state before the
maneuver can be bias,
second
Because of
given.
this
the combination of two maneuvers
may be
given mainlv to patients whose
prognostic
anticipations
are
significantly
better or worse than the patients
who
re-
ceive only one of the two. For example, in
surveys of the treatment of lung cancer, the
maneuver
of surgery
alone
may be
the
later
date.
population,
In comparison to the the
later
group
may
contain milder cases of the disease (with ) and may also be exposed methods of ancillary treatments
better prognoses to
better
that
were disregarded
in the
environmental
assessment. Examples of this tvpe of dys-
temporal
comparison
were
cited
earlier
during the discussion of therapeutic surveys of anticoagulants in myocardial infarction. 8 •
•
•
In his introduction to the Design of Ex-
Components
periments, Sir Ronald Fisher" makes the
following remarks
The totally
"His
authorative
assertion,
controls
are
must have temporarily
dis-
many
credited
a promising line of work.
.
made by what heavyweight authority (who has)
type of criticism call a
is
usually
.
.
I
.
This
.
.
reputation.
scientific
are seldom in evidence.
method nature notions
.
.
.
.
.
.
it
the principles
of
so
is,
as
3.
design
R.:
.
.
Planning
clinical
experiments,
1968,
Charles
C Thomas,
111.,
The Williams & Wilkins Company.
1967,
The
A.
R.
criticism was unfortunately not followed by the development of technical details
about the logical structure of scientific research. Although stating that "statistical
and experimental design are only two different aspects of the same procedure
A.
Feinstein,
6.
7.
9.
II.
Pharmacol. Ther. 11:282-
Gifford,
R.
critique
of
The procedures
New
and the
cited
10.
Handy,
knowledge of
distinctions
12.
Springfield,
tecture of a clinical investigation.
require
The
ini-
Northrop, F.
search,
of this series will
principles
nine
that
are
remaining
used
complete the basic architecture of a
to
clini-
C
The
principles
of
science,
1963,
W.
B.
Saunders
C: The
logic of the sciences
New
York, 1959, Meridian
Foundations of experimental re& Row, Pub-
R.:
York, 1968, Harper
lishers.
15. Schnur,
S.:
Mortality and other studies ques-
tioning evidence for and value of routine anti-
coagulant therapy in acute myocardial infarction,
Circulation 7:855-868,
1953.
17.
Evidence in science, Bristol, 1966, John Wright & Sons, Ltd. Tillman, C: Acute myocardial infarction: Ten-year study of consecutive cases managed and evaluated by same phvsician, Arch. Intern.
18.
Wilson,
16. Stone,
K.:
Med. 111:77-82,' 1963. E.
19. Witts,
B.:
New
L.
An
introduction
to
scientific
York, 1952, McGraw-Hill
Book
Inc. J.,
clinical trials,
editor: Medical surveys and London, 1964, Oxford University
Press.
20. Wolf, A.: Essentials of scientific
don, 1925,
cal investigation.
S.
New
Company, the
Charles
1964,
Books, Inc.
research,
continue.
S.:
Philadelphia,
2,
14. Plutchik,
must be diag-
demarcated and prognostically stratified; the targets of the subsequent state must be identified, differentiated, and correlated with the prognostic stratification; and the maneuver(s) must be properly chosen for potency, comparison, multiplicity, and concurrency. After these principles of scientific logic and precision have been fulfilled in specifying the objective of the research, the rest of the design can
W.
and the humanities,
nostically
with
111.,
Company.
sequent operations in planning the archi-
concerned
S.
London, 1900, The Macmillan Company. Mainland, D.: Elementarv medical statistics, ed.
sta13.
The next two papers
E &
Methodology of the behavioral
R.:
ll.Jevons,
attention than any of the sub-
state of the population
1961,
Thomas, Publisher.
than theoretical
clinical science rather
Med. 280:351-357, 1969.
J.
Lectures on the methodology
of clinical research, Edinburgh,
lineating the objective of a clinical research
project require a thorough
Eng.
Hamilton, M.:
sciences,
described for de-
just
H., and Feinstein, A. R.: A methodology in studies of anti-
Livingstone, Ltd.
lyzing the logic of the design.
operational
Biostatistics.
coagulant therapy for acute myocardial infarc-
experiments, but neither he nor
orous set of principles to be used for ana-
be
Clinical
R.:
versus science in the design of ex-
(Publishers), Ltd. 8.
his statistical successors established a rig-
tial
III.
therapy,
in
Fisher, R. A.: Design of experiments, ed. 8, Edinburgh, 1966, Oliver & Boyd, Ltd. Freedman, P.: The principles of scientific research, London, 1949, MacDonald & Co.
tion,
of strategies for mathematical analysis of
tistics,
epidemiology.
statistics
292, 1970.
whole," Fisher created a magnificent set
much more
of
.
periments, Clin.
scientific
Clinical
:
design
clinical
Ann. Intern. Med. 69:1287-1312, 1968. 5.
scorn for imprecise standards of
Fisher's
scientific
Random House,
Feinstein, A. R.: Clinical judgment, Baltimore,
4. Feinstein,
theoretical
of experimental
K.
Statistics
are lacking.
of
art
York, 1950,
Publisher.
Technical details
long
Cox,
Springfield,
judgment must surely continue, human
of
being what
The
B.:
I.
New
Inc. 2.
pro-
Such an authoritarian
W.
Beveridge,
investigation,
might
longed experience, or at least the long possession of a
References 1.
inadequate"
37
of the research objective
method, Lon-
The Macmillan Company.
CHAPTER
4
Intake, maintenance,
and
identification
In the bio-statistical architecture of
clini-
cal research, the first operational principle
to specify the components
is
and choose
the logic of the objective of the research.
As discussed series,
11
the previous paper of this
the components consist of a se-
quence of sequent
in
initial state,
state.
The
maneuver, and sub-
logic consists of suit-
Before a group of people can be studied
each person must undergo a series of transfers that bring him from his home to his position as a unit in in a clinical investigation,
the research.
The
diverse
decisions
that
determine the transfers can greatly alter the population available to form the "initial state." Consequently, the second opera-
able scientific judgment in the decisions
tion of biostatistical architecture
made
lineate the role of these decisions in pro-
demarcate the diagnostic and
to
prognostic conditions of the the population;
to
identify,
viding the populational "intake" for the
differentiate,
research. Because the people under inves-
After the objective of a research project
in
operations
its
implementation
Intake
is
nine subsequent architectural
whose
principles
are outlined
here and in the next paper of this 2.
tigation
make so many of the crucial word intake seems more ap-
choices, the
propriate for this procedure than the conventional term, sampling, which does not
connote the important
currency.
planned
to de-
initial state of
and prognostically correlate the diverse targets of the subsequent state; and to choose maneuvers that are satisfactory in potency, comparison, multiplicity, and con-
has been specified,
is
series.
effects of self-selec-
tion in the populational "samples" studied in clinical research.
The judgments fers are the
that govern these transproduct of decisions made by
the people
who
are
studied,
by various
sources of referral, and by the investigators -^
or their professional colleagues.
.-.INITIAL
© STATE
MANEUVER
SUBSEQUENT STATE
A. tion
decisions.
of a population
The
individual
under investiga-
must make several types of personal becoming units in the re-
decision before
'Intake
—
This chapter originally appeared as "Clinical biostatistics TV. The architecture of clinical research (continued)." In Clin. Pharmacol. Ther. 3:432, 1970.
38
Personal
members
search. 1,
latrotropic
stimuli.
The
iatrotropic
stimulus has been defined as the reason
and
Intake, maintenance,
that a patient seeks medical attention. 5
This attention must be
'
G
solicited as the first
step in the process that ultimately brings
a research
the patient into
con-
project
ducted at a medical setting. The stimulus can arise from symptoms of a disease or can come from such features as anxiety
one of these alternative
39
identification
he may receive an and he must then
offers,
proposal,
decide about accepting the alternative. None of these personal decisions is made randomly, and all of their effects must be
when
carefully considered
the results of
The
a research project are analyzed.
effects
over death of a friend, a pre-employment physical examination, or the incentives
may
created by public health campaigns urging
ing the assessment of the maneuvers, or
types of "checkups." Because
various
of
group
differences in iatrotropic stimuli, a
may
of patients with the "same" disease
greatly alter the initial state of the
population either prognostically, confound-
impeding attempts to exemploy statisconcepts based on "random samp-
diagnostically,
trapolate the results or to tical
be found at different stages of the disease, and may therefore have different prog-
ling."
nostic anticipations. 5
is
In certain types of research projects, the
motivation for the subjects' entry
not
is
desire to visit a doctor for specific medical attention. Instead, the subjects are solicited directly by the investigator via a
the
mailed questionnaire or a public volunteers.
call
for
The populational response
eli-
by a mailed questionnaire will be by the recipient's attitudes
cited
greatly affected
toward the
topic,
that are asked,
the kinds of questions
and the
difficulty or incon-
B. Referral decisions.
seldom the
patients tion
whom
he
setting will
1.
investigator
studies,
and the popula-
available for research at a medical
have been altered by many an-
tecedent decisions tor's
The
physician to see the
first
made by
the investiga-
medical colleagues. Interiatric referrals. Just as a patient
can have diverse reasons for his original choice of a medical setting, doctors or hospitals can have varied patterns of practice
that lead
to
from one doctor
transfer of the patient
to another,
from one hos-
or from one service to
encountered in answering the questions. 3 The composition of a group of "volunteers" may strongly depend on the monetary or other incentives offered by the investigator or on the psychic state of the
pital
potential "volunteers." 17
cian to "release" his patient for entry into
venience
2.
Iatric attractions. After a
become
cides to
person de-
a patient, his choice of a
doctor or hospital will be affected by his to such features as reputation, geographic location, technologic fa-
to
another,
another within the same hospital. In addition to these interiatric referrals, iatric decision depends on the willingness of an attending physi-
another aspect of
a therapeutic trial in which treatment for
a particular disease
randomization.
If
at
cost,
thusiastic about
and ethnic or
cilities,
A
religious considera-
particular doctor or hospital
may
peutic agents in the
may
therefore receive a collection of patients
be subjected collection
and other aspects of the population
may be
Acceptance of proposals. After reaching medical attention, the patient may ac3.
cept or reject the various referrals, diagnostic
"work-ups,"
or
trial,
the physicians
of
therapeutic
to
the
patients
trial
may
randomization. "referred"
thus
for
become
The the
a dis-
torted version of the general spectrum of
the disease seen in regular clinical practice.
strikingly different.
therapeutic
dures that are offered to him.
be assigned by
not allow certain types of patients to
with the same "disease" seen by other doctors and hospitals, but the prognostic severity
to
an individual medical setting are unenone or more of the thera-
reactions
tions.
is
the attending physicians
If
proce-
he rejects
2.
Retrieval of medical records.
When
performed by reviewing the records of patients at a medical center, the investigator must beware of bias that can enter during the retrieval of the a survey
is
The
40
architecture of cohort research
10 For records from their storage locations.
example, an investigator may not receive all the records he requests from the medical record librarian because some other investigator may have sequestered some records for a research project of
of
the
his
own; and the absence of those records
may dim
create a significant bias in the
population available for the
first
investi-
gator's scrutiny.
Another kind of bias may
arise
if
the
in-
The diagnoses
that are
made
in specialized
emergency room, the radiotherapy and chemotherapy clinics, or even the general medical and surgical clinics) may not be recorded in out-patient settings (such as the
the diagnostic
files
brary; and. in
many
about out-patients
maintained in the lihospitals, information not included
is
the in-patient records. 10
By
among
confining his
population to the in-patient group available from
the
medical record
of a
files
librarv, the investigator
may
thus limit his
spectrum of a disease to the more severe instances in which the patients required hospitalization,
patients
ignoring
who were
the
many
other
treated in ambulatory
by doctors in other parts of the hospital or in the community. In both of
A
criteria.
population
may
used for diagnosis of "health" or "disease." For example, if protracted chest pain is de-
manded
a
as
criteria
"acute myo-
for
criterion
cardial infarction," the population will ex-
who had
clude patients
only milder forms
more severely ill patients whose infarction was manifested only by sudden onset of arrhythmias or of thoracic pain or
congestive heart failure.
Outside of a medical setting, the popumay have reached the investigator
vestigator uses the medical record library as his only source of diagnostic retrieval.
Diagnostic
1.
be greatly altered by the
lation
directly by "volunteering" for a project or by answering a questionnaire. Nevertheless, the investigator must still decide which questionnaires or volunteers will be
He may
used for the research.
questionnaires that have been filled
out
volunteers
improperly,
who seem
may
he
or
exclude
smudged
or
reject
unattractive in ap-
pearance or in personality. He must then be concerned about the way that these selections have altered the population that becomes the group of research subjects. 2. Co-morbidity criteria. The presence of major associated diseases may create problems in establishing diagnosis for a particular disease 14
and
choosing or ex-
in
many
settings
cluding patients for therapy.
the instances just cited
co-morbid patients are excluded from an investigation, the resultant "pure" population with the main disease may not repre-
records and vestigator
—sequestration limitations of source— the
may
of in-
receive a distorted picture
If
too
sent the true state of that disease in clini-
of the patients associated with the thera-
cal
peutic maneuvers
morbid diseases may have major prog-
in a survey of treatment
reality.
In
a
study
of
therapy,
co-
unrecognized
for a particular disease.
nostic
C. Eligibility decisions. After the basic populational intake has been brought to
sources of bias in the post-therapeutic re-
of an investigation, a series of eligibility decisions will reduce the population to the group of people who will constitute the initial state. These decisions which may be made by the investigator, by his colleagues, or by both involve the
the
site
—
choice of diagnostic criteria, co-morbidity
and pretherapeutic
criteria,
determine a patient's ticu
;
i
ageni
r
criteria 8,
that
eligibility for a par-
activity of ordinary it
14
or of research.
clinical
man-
that
effects
create
sults. 3.
Pretherapeutic
criteria.
co-morbidity,
other
peutic criteria
may be used
kinds
In addition to of
prethera-
to include or
exclude patients from an investigation of therapy.
These
criteria
may depend on
such features as age, economic and geographic status, and severity or extent of the main disease under treatment.
For example, in order to be regarded as having "operable cancer," a patient must have sintable diagnostic evidence of the
Intake, maintenance,
he must be free of associated comorbid ailments that might impair his tolerance of the operative procedure or his subsequent survival; and the cancer must be appropriately localized in an anatomic disease;
position that permits resection. Thus, the
two groups
of patients with can vary greatly acthe criteria chosen to define the
collection of
cancer"
"operable
cording to
concepts of "suitable" diagnostic evidence,
by
its
and
41
identification
selection through replies to a mailed
questionnaire rather than through personal interviews.
The neglect of populational bias caused by these transfer decisions is one of the main sources of nonreproducible data and defective scientific logic in contemporary biostatistics. 3.
Maintenance
"impairing" co-morbid ailments, and "appropriate" anatomic localization. Moreover,
may be
a patient with "inoperable" cancer
because he refused a proposed operative resection, or because the physician decided that the tumor was unresectable, or because death occurred before any of these decisions could be made. "inoperable"
Since
these
nostic
kinds
three
of
"inoperable"
have major differences
patients
anticipation, 4
in
prog-
cannot be
they
re-
garded as comparable in their initial state. An additional problem in pretherapeutic criteria is
—particularly
in long-term trials
SUBSEQUENT
MANEUVER,»
-INITIAL
© STATE
STATE
J3J
Maintenance
The next
step
in
planning a research
about the methods be used to maintain the population long enough for the subsequent state to occur and to be observed. When the project
is
to
decide
that will
the investigator's frequent decision to
objective of a research project is specified, the investigator creates the concepts that
admission to "compliant" people particularly likely to maintain
determine whether the project is worth doing; when the maintenance of the popu-
restrict
who appear the
assigned maneuver and to
continue
under investigation. The absence of the "noncompliant" candidates, however, may create a serious bias in the spectrum of the observed disease.
As a
result of these diverse decisions
patients
and doctors about
ceptance of proposals, gibility,
and other
by
iatrotropy, ac-
iatric referrals, eli-
mechanisms, apparently have
transfer
two groups of people who the same "initial state" may differ in many characteristics that can greatly affect the outcome in the subsequent state. Major differences in people with an apparently similar initial state
medical setting
is
may
thus arise
urban or
if
the
rural, large or
small, American or British. Within the same academic hospital center, great diversities
may
exist in populations
chosen
from private or ward services, surgical or medical services, and "university" or "VA" services. A population that is not at a medical setting
may be
significantly altered
lation
is
the
contemplated,
investigator
determines whether the project
and can be carried
out.
The
is
feasible
ability
to
maintain the population is particularly important in research dealing with the cause, course, or treatment of chronic disease, since a long time may elapse between the initial
and subsequent
states.
No
matter
how well the rest of the research project the is planned, the plans may be wasted if population cannot be adequately preserved and examined.
Many different issues in maintenance* must be considered: duration and frequency of examinations; opportunity for detection of target; adherence to maneuvers; and preservation of protocol, data, and investigators. A. Duration and frequency of examinations.
Members
of an investigated popula-
°The complex issues beyond the scope of this
of ethics in outline.
clinical research
are
The
42
architecture of cohort research
must be encouraged
remain
in the
instituting appropriate types of ancillary or
project for a duration suitable for the con-
replacement therapy. B. Opportunity for detection of target. If the "target" in the subsequent state is the development of an event that was not
tion
to
ditions and maneuvers under investigation. For example, a much longer duration of observation will usually be required when
maneuver provides treatment
the
common
cancer than for the
couragement
cold.
for
The
a
en-
continued par-
to a patient's
present in the
initial
state,
the compared
populations should have equal opportunities for that
be detected when
it
opportunities will be equal
if
target to
The
ticipation, particularly in a long-term study,
occurs.
depend on many aspects of the interchange among patient, doctor, and investigative setting. For example, to achieve
all the people exposed to the different maneuvers are also exposed to the same thoroughness and frequency of subsequent examinations, but inequalities will commonly be produced by various characteristics of the initial state, the maneuver, or the target. The subsequent comparison of data may then be fallacious because differ-
will
persistent attendance of patients at a clin-
the
ic,
investigator
may need
to
make
many
special arrangements for such fea-
tures
as
clinics,
of staff, location and and transportation of pa-
The frequency
of examination will also
timing of
efficiency
ent
tients.
depend on the circumstances of the tigation.
Routine
may be needed
periodic
inves-
conditions
more
target
group of people than
One
examinations
nation. People
sought often enough. For example,
and
of the is
main
if
one
targets in the subsequent state
the occurrence of a streptococcal infec-
tion, the patient
may need
to
be examined
monthly or bimonthly intervals for the performance of suitable tests. On the other hand, monthly observations would not be needed to determine the occurrence of at
in the patient's clinical or tus.
geographic
sta-
In experiments conducted by double-
blind techniques (in which neither the patient
nor the examining physician knows
is
events.
The procedure should
include suit-
able arrangements for learning the identity of the
maneuver
in
emergency
that have led to
situations,
maneuvers major adversities, and for
for discontinuing or modifying
whom
less prevalent.
medical super-
Once a treatment
been suspected of causing adverse
has
side effects, the occurrence rate of those effects
may
increase
spuriously
because
they are sought with increased attention
people
receiving
themselves
the
often
and
The
treatment.
may become
fearful,
so that they solicit medical attention
more
for lesser reasons than usual, or
the doctors
may become
particularly dili-
gent about detecting the side
effects.
For
example, the high rate of phlebitis found in nurses (with or without "the pill")
merely
may
both to note the condition and to seek medical atreflect
tention for
A
the identity of the maneuver), a "fail-safe"
procedure must be established for the potential occurrence of disastrous clinical
is
who can see doctors easily may appear to have
eases than people for vision
in
and the maintenance of communication to learn about incidental events and changes
in another.
higher rates of minor or even major dis-
patients
tion
the
one
conveniently
Other important reasons for periodic examinations are the regulation of medica-
(when appropriate), the preservaof the patient's interest and morale,
in
the population's access to medical exami-
death.
tions
be detected
obvious source of such disparities
evidence about phenomena that would be missed if not to obtain
made
observation
of
likely to
a
their
propensity
it.
second cause of
maneuver
itself
administration or
difficulty arises
when
requires techniques of
observation
that
differ
from the procedures used in maintaining another maneuver. For example, patients receiving long-term anticoagulants must periodic examinations for the dosage of anticoagulants to be regulated,
receive
Intake, maintenance,
and
these examinations, the physician
at
the opportunity to detect minor
also has
intercurrent illnesses. of such illnesses er
may
The subsequent
who
did not
Another source of disparity is the population's access to medical technology. If certain technologic procedures (such as electrocardiography, roentgenography, and available
another, the ferent rates
more
are
tests)
one
for
consistently
population
two populations
will
than
for
have
dif-
of identification for diseases
whose diagnoses require these technologic procedures for confirmation or exclusion. For example, facilities for prompt, inexpensive identification of streptococcal infections
have recently been made available by the health departments
to practitioners
of several states.
The
rates of streptococcal
infection in those states
soared.
Similarly,
have subsequently
a major "epidemic" of
lupus erythematosus has occurred in the
two decades
past
after the introduction of
effective laboratory tests for identifying the
The here
is
last
source of disparity to be noted
the performance of routine "screen-
the
two populations.
rates of disease in a certain region of
body
for
two populations
if one population has that region examined only when appropriate symptoms appear, whereas the other popula-
bly differ
tion receives routine periodic examinations
regardless of symptoms.
The neglect
of these problems has
been
a major source of intellectual blunders in
contemporary biostatistics, and has greatly impaired the validity of many epidemiologic surveys of the occurrence rates and therapy of disease. Berkson 2 has demonstrated the statistical fallacies of studying
the rates of concomitant diseases in hospital
populations,
but egregious
statistical
occur for single diseases in non-hospital populations with unequal opportunities for detection of
fallacies
are
just
as
likely
to
receiving
may have
group
higher
contraceptive
il-
are
pills
the
whose medical
the
a rate of subsequent throm-
or cervical cancer that
than
spuriously
is
found in the "controls," was not checked with equally
rate
state
careful attention.
The
executive
personnel
in an industrial have a higher rate of coronary artery disease than the laborers, but the difference may not be due to any characteristics inherent in the executives. They may have been subjected to routine cardiac "screening" examinations that enabled all of their symptomatic and asymptomatic coronary disease to be diagnosed before death, whereas much of asymptomatic coronarv disease in the "unscreened" laborers may not have been detected during life. 3. In comparison to people who have not had chronic constipation, people with this symptom are more likely both to have used suppositories and to have received barium enema examinations. If an epidemiologic statistician discovers that suppository users have a higher rate of colonic polyps than non-suppository users, he may then regard 2.
population
may be found
to
as a probable cause for the with no attention given to the unequal exposure of the two populations to the barium
the
suppositories
polyps,
A
for detecting the polyps.
different variant of this
problem occurs
in
epidemiologic research in which the subsequent state is described by entries on a death certificate.
If
survey, patient's
will inevita-
can be
use mechanical types of contraception,
"pill"
4.
ing" examinations in the
The
who
enemas needed
disease.
fallacies
examined more closely or more often than women
bophlebitis
receive equally frequent examinations.
laboratory
women
If
1.
the anticoagulant group than in a
in
the
of
43
identification
lustrated as follows:
rate
then be falsely high-
"placebo" or untreated group
Some
target.
and
the development of disease
the
will
investigators
premortem data
to
often
X
is under check the
eliminate false posi-
but the false negative side of the diagnostic issue is almost never explored to see whether disease X might have occurred without being listed on the death certificate. 7 In this way, many epidemiologic studies based on death certificate diagnoses of arteriosclerosis or cancer provide inaccurate results that differ significantly from the data found by necropsy tive diagnoses of disease X,
or
by thorough methods 5.
To
illustrate
of
premortem examination.
misleading therapeutic results
produced by a maneuver's
distortion of equal op-
portunities for target detection, consider the fol-
lowing example.
Suppose we want
to
test
the
hypothesis that large daily doses of aspirin will
be useful prophylaxis for preventing streptococcal infections, and suppose our method of detecting the infections is to perform a throat culture whenillness occurs. In this experimental with these techniques of target detec-
ever a febrile situation,
the "treated" group, which has its fever suppressed by the aspirin, will have a much lower
tion,
The architecture
44
of cohort research
rate of streptococcal detection than the untreated
and we might then
"controls,"
falsely
conclude
an effective antistreptococcal agent. example of "detectional bias" occurs in a situation, quite different from those just cited, in which detection of the target creates an additional clinical hazard for one populational group but not for the other. Suppose we are testing medical therapy versus vascular-implant surgery for coronarv artery disease. Both groups of pathat aspirin
A
6.
is
project
These
who
or
cussed
by
final
tients
receive coronarv arteriography to delineate
their
initial
state,
repeat
receives
but
the
surgical
arteriograms
of the coronary
to
test
group later the patency
ineffectually.
further considered here. 4.
identification
Initial
SUBSEQUENT^
^.INITIAL
© STATE
STATE
vasculature after the operation.
The morbidity and
participate
have been thoroughly disMainland 18 and will not be
issues
(D
Initial
Identification
mortality rates associated with
the second arteriographic procedure for the surgical
group create an additional investigative hazard
that
is
not present
in
the medically treated
pa-
Either the experimental plan or the subse-
tients.
quent analysis would require suitable modifications to avoid the bias thus imposed, after the main experimental maneuver, on the patients treated surgically.
Identification
When
an
a core problem in scien-
vestigation have
validated.
C. Adherence to maneuvers.
is
research. Unless the entities
under inbeen suitably identified, the work cannot be reproduced; the data cannot be assessed; the results cannot be tific
lems
of
In clinical research, the probare
identification
much
greater
oral drug, diet, or other activity (such as
than in any other form of investigation,
exercise)
is to be maintained by the pahome, away from the site of the investigation, the patient must be encouraged to adhere to the assigned maneuver. Moreover, regardless whether the of maneuver was assigned by the investigator or chosen by the patient, the assessment of the patient's adherence to the maneuver is a critical feature of the subsequent evaluation, and depends on suitable information about adherence, obtained by means
because
tient
many human
at
the
clinician
must
attributes
that are not en-
cope
with
countered in animals, animate fragments, or inanimate substances. 5 These tributes
human
at-
are not only different from the
isolated variables studied in nonclinical re-
much more abundant and they must be assessed before, during, and after the experimental maneuver. search; they are also
Despite these distinctions in identification,
more objective procedures. The evaluation
most traditional discussions of "experimental design" contain little or no attention to the problems of choosing, observing, and
of such data will be discussed at a later
classifying the basic evidence.
of direct questions to the patient or with
The
stage of the architecture.
D. Preservation of protocol, data, and
When
identification
architecture
is
stage
of
research
intended to provide imple-
a long period elapses
mentation for the ideas expressed in the
between the maneuver and the subsequent
objective of the project. All of the general
investigators.
when many
state,
or
tients,
and
examinations,
pa-
investigators are involved in a
concepts that described the objective of and mainte-
the research, and the intake
research project, the principal investigator
nance of the population, must now be
must establish certaining
pressed operationally in terms of the observed evidence. For example, the diag-
pants, for coordinating the collection
satisfactory
methods
for as-
that the research protocol is properly carried out by the diverse partician;
men have
sis
of
of data,
and
and
and prognostic requirements of the population may have been stated in such
for suitable replace-
categorical phrases as "healthy," "anemic,"
who
"angina pectoris," or "good risk," but none of these phrases indicates the actual items
collaborating
liscontinued
ex-
investigators
participation
in
the
nostic
and
Intake, maintenance,
be examined and
of evidence that will
in-
Each property (or
A. Types of evidence.
be used
"variable") to
an investigation
in
depends on the observation of certain basic evidence. Thus, the presence of anemia may be assessed from measurements of blood hemoglobin or hematocrit; the presence of angina pectoris may be noted from history taking; and anatomic metastases of a cancer may be sought with clinical examination, roentgenography, endoscopy, cytology,
biopsy,
necropsy.
The
evidence
in
exploration,
surgical
or
characteristics of the basic
an
must
investigation
be
thoughtfully considered, because the qualthe
ity of
raw data can not only
affect the
elemental facts but also distort the varia-
chosen
bles
between adjacent
vals
such values are
terpreted.
represent
to
elemental
the
facts.
The raw data obtained
in a clinical in-
can be verbal or numerical. Thus, a particular person can be described vestigation
verbally as a red-haired
American laborer
.
.
Examples of
ranks.
0, 1, 2,
and
of children,
3
.
.
.
for
67, 68, 69, 70
.
45
identification
number .
.
.
for
inches of height.
The other three types of variables are expressed with categorical rather than di-
An
mensional values.
ordinal variable has
semiquantitative values that can be ranked a graded order, but the intervals between any two adjacent ranks are not measurably equal. Examples of such values are 0, 1+, 2+, 3+, and 4+ for briskness of reflexes, and none, mild, moderate, and severe for severity of dyspnea. A nominal variable has values that cannot be ranked in a graded order. Examples of such values are red for color of hair, American for nationality, and laborer for occupation. For an existential variable, the scale of values consists of present and absent, or yes and no. Examples of existential variables are presence of chest pain and survival for at in
6 months. The values for existential
least
variables can sometimes be semiordered in
with chest pain,
a scale such as definitely absent, probably
inches
absent,
and numerically as 68 and the father of 4 children, with a serum cholesterol of 260 mg. per cent. Regardless of whether the basic evitall,
term "hard" is often applied to data for such variables as age, sex, weight, serum cholesterol, and death whose observation and dence
is
verbal
numerical,
or
the
—
—
interpretation require
judgments. to data
istence
The term
—such of
few or no subjective "soft"
is
as statements
angina
many
subjective
activities
and interpretation. obviously
more
research
have
variables
of
been described
else-
where. 12
From
C. Selection of indexes. verse data
collected
the di-
an investigation,
in
often applied
certain variables or combinations of vari-
ables
severity
—that
in
of
require
observation
may be used
particularly
research. for
as indexes to delineate
important
properties
in
the
These are the properties used
any of the
graphs,
tabulations,
or
hard data are
other decisions in the "admission" of peo-
reliable than soft data, the
ple to the project, or in the analysis of
Since
clinico-statistical collaborators
a
The procedures used for converting raw data into values for these different types
about the ex-
pectoris,
dyspnea, or ability to work
probably present, and
uncertain,
definitely present.
project
who
usually
will
design prefer,
Although the investigators may results. have assembled information about a great
whenever possible, to work with hard data as the main source of evidence. B. Types of variables. The raw data of an investigation can be preserved intact
many
or converted into the values for four dif-
Thus, data for such topics as name, address, occupation, height, and serum calcium value might have been obtained during
ferent types of variables. is
expressed
in
A
metric variable
dimensionally
ranked
values that have measurably equal inter-
the
variables,
ones
that
the index actually
variables
become used
are in
including or excluding people from the project
the
and
in the appraisal of the data.
research,
but might never be used
The
46
architecture of cohort research
in the subsequent analyses, whereas information about such topics as
thereafter
ral
versus multitemporal.
for
an index variable
existence of disease X, severity of cardiac decompensation, prognostic risk, and serum cholesterol value might become index
state
variables.
single
time
in
may be
more temporal
or
index
The value used
at a particular single
chosen from one A unitemporal
values.
based on the patient's state at a in time, whereas a multi-
is
point
As the "key" data from which critical conclusions will be drawn,
temporal index involves consideration of more than one temporal state.
the index variables require careful selection.
For example, before administering an agent intended to lower blood pressure,
investigative
Form
of expression: Dimensional versus categorical. Despite the mathematical 1.
many
appeal of dimensional data,
may become
variables
scientifically or clinically
more
meaningful when their numerical values
we might
obtain a series of "base-line" or
by measuring the patient's blood pressure daily for two weeks. When we later analyzed the patient's response "control" values
we would
are converted into categorical expressions.
to
For example, although a person's hematocrit can be measured quantitatively, the result is often more meaningful when stated in one of the ordinal categories: anemic, normal, or polycythemic. Similarly, such measurements as temperature, serum cholesterol, and streptococcal antibody titer
use
before the therapeutic agent was given; and the index would be multitemporal if the single value for initial state depended on a mean, median, or mode of the two-
are often best classified not in their origi-
week
nal dimensions but in the converted ordinal
more,
if
were
initially
categories of high, normal,
The conversion gories
may be
and low.
of dimensions
cate-
valuable for
particularly
assessing the importance or
to
meaning
of a
change in a variable. For example, suppose "statistical significance" has been noted for the comparison of an average decrement of 5 in one group of patients, with an average of 2 in the other group. Regardless of the "statistical significance,"
would not be meaningful clinically if the changes were "falls" of 5 mm. per hour and 2 mm. per hour from difference
this
an
initial
Westergren sedimentation rate mm. per hour in both
that averaged 180
groups. In this situation, the
common
sense
judgment would depend on the categorical decision that both 5 mm. per hour and 2 mm. per hour were too small a change for the transition to be regarded as a significant fall. tions,
In
categorical
many
analogous situa-
distinctions
are
invalu-
able for making decisions about whether
a
"statistically
clim 2.
significant"
difference
Uy meaningful. l ironologic components:
is
Unitempo-
the antihypertensive agent,
unitemporal index for the initial state of blood pressure if the value depended only on a single reading just a
series of "base-line" readings. Further-
the values in the base-line series
high
and then
fell
to
a
"plateau" just before the onset of therapy,
we would have
whether to choose from all of the base-line readings, or only from those conthe
initial
tained
in
choice
is
to decide
state
index
the "plateau." The way this made might have major effects
on the magnitude of the blood pressure by the therapeu-
available for "lowering" tic
agent.
Another consists
of
type a
of
sum
multitemporal of
units
rather
index
than
an average of measurements. Thus, as an index of response to pharmaceutical treatment of angina pectoris, we might count the total number of episodes of angina or the number of nitroglycerine tablets consumed during the period of treatment. If
the index consists of a discrete event,
rather
gator
than a measurement, the investi-
may need
to
distinguish repetitive
from those that are single or sporadic. For example, in their initial state before treatment of primary lung cancer, two patients may both have or persistent events
and
Intake, maintenance,
had hemoptysis that first occurred three months previously, but one patient may have had a single episode of hemoptysis, with no repetition, whereas the other may have had recurrent daily episodes. Similarly, a patient who had a severe bout of chest pain that lasted for one day, two weeks previously, is different from a pa-
whose chest pain has persisted unchanged for two weeks, but the difference would not be stipulated if the index variable were expressed merely as exist-
tient
ence of chest pain during present illness. 3. Number of constituents: Elemental versus composite. An elemental index is based on a single variable in the investigation, whereas a composite index contains data
from two or more variables. A may be elemental or com-
particular index
posite according to the types of variables
used for its creation. For example, severity of dyspnea would be an elemental index if
based on a single "global" assessment of and composite if it includes
dyspnea,
specific contributions
from such variables
4.
47
identification
Form
of aggregation: Boolean clusters additive scores. The constituents
versus
of a composite index can in at least
two
be aggregated
A "Boolean group of individual
different ways.
cluster" consists of a
categories that are present or absent to-
gether in various
ample,
the
For ex-
combinations.
condition
of
a
patient
with
acute myocardial infarction could be called good, if he has neither shock nor pulmonary
edema;
he has one of these compliand poor, if he has both. An "additive score" is prepared by assigning an arbitrary score (or "weight") to each constituent variable; and the value of the composite index is the sum of these scores. For example, dyspnea might be given 20 points, and tachycardia, peripheral edema, or a large liver, might each be given 10 points, so that a patient who fair, if
cations but not both;
has
all
would
four of these manifestations
receive a score of 50 points.
The two methods different
of aggregation are as
and
logic
as
arithmetic.
In the
as the metric respiratory rate, the existen-
"Boolean cluster" procedure, the categories of each constituent variable are either
presence of orthopnea, and the ordinal
present or absent, and the values of the
tial
amount
exertion
of
needed
to
produce
respiratory distress.
For reproducibility of
an index be prepared in a specifically composite manner, so that each constituent can be identified. If an index with many complex ingredients is derived in an elemental manner, as a global act of "judgment," the procedure cannot be reproducible, because the ingredients will not be
that contains
stipulated.
many
Thus,
results,
constituents should
as
a
single
variable,
thromboembolic phenomenon can be rated as present or absent, but the ingredients of the rating will depend on the presence or absence of such constituent
existence of
variables as pain in leg, circumference of calf,
hemoptysis,
and
roentgenographic
abnormalities. If thromboembolic
phenome-
non were to be used as an index variable, its
existence
specific
index
itself
are chosen as arbitrary
names
for simultaneous combinations of categories.
would require
definition with
diagnostic criteria established for
the appraisal of the constituent evidence.
In the "additive score" procedure,
category
each
assigned an arbitrary dimen-
is
and the index is the sum Both the Boolean and the additive techniques may sometimes be combined in a single index. For example, sional
value,
of
those values.
to
fulfill
the modified Jones criteria 1 for
diagnosis of rheumatic fever, a patient
is
required to have a combination of "major" and "minor" features. The combination is
by any two features of the "major" group together with one of the "minor" satisfied
features, or vice versa. 5.
Degree
contrived.
of
The
artifice:
collection
a composite index
is
Natural
versus
of variables in
"natural"
if
the vari-
homogeneous frame physiologic or clinical reference, and are
ables have a reasonably of
ordinarily joined in clinical reasoning. Thus,
an earlier example, the index for thromboembolic phenomenon was based on a in
The architecture
48
of cohort research
be assessed
natural combination of variables, such as
that
symptoms and physical signs in the legs and chest. On the other hand, if we established a "thromboembolism index" that
placed by a second variable with which the main one appears to be correlated, for example, the hematocrit measurement
included
such
features
age,
as
serum sodium, serum prothrombin time, as well
weight,
and
height,
potassium,
features
already
the
as
new
the
cited,
index
would be contrived, because its heterogeneous components do not have a common physiologic frame of reference. They are not combined in ordinary clinical reasoning, and their conjunction in the index was purely arbitrary.
Some
ot
the problems of hetcrogeneous-
ly contrived
by Mainland.
indexes have been discussed
One
1 '
of such contrivances
may be combined the'
of the is
main problems
that diverse entities
manner
in a
that
makes
individual entities unrecognizable, par-
when
ticularly
the
responses
assess
Thus,
neuver.
to
in
used to an experimental maindexes
some
are
the contrived
of
indexes used in rheumatoid arthritis, patient
who
remains
a
sedimentation
his
persistently
could
elevated
physically
might replace the hemoglobin value, or thyroid uptake of radioactive iodine might replace the measurement of serum protein-bound iodine. A substantive index the
a
is
various psychologic tests for
congestive heart failure. 6.
Application
variables
many
and moreover, fulfills
after learning
these
criteria,
know whether
somewhat
clinical
>ne in
investigative
of
consideration here, the indexes are used for identifying the initial state of the
and
lation,
they
provide
popu-
data
the
for
such necessities of research as diagnostic criteria,
prognostic strata,
requirements.
At
later
and
eligibility
of the
stages
indexes for a research project
the
may displace the research. He may elect to
for
phenomena
investigator
the
less
arbitrary
and
research. A homologous which the main variable
that are
because
and
they
reliably,
nomena may not
re-
is
that the
objective of
use indexes
scientifically
can
at-
be assessed
but the selected phe-
accurately indicate the
original goals of the research.
For example,
a
heterogeneous manner, are commonly is
stages
In the stage under general
used to identify targets in the subsequent state, and transitions from the initial to the subsequent state. D. Displacement of Objective. Perhaps the greatest intellectual hazard in choosing
other types of contrived indexes,
ii.
single
while
had congestive heart failure alone, chorea alone, or only arthritis.
used index
The
indexes.
hetero-
patient
less
of
complex combinations that indexes can be employed at
different
architecture.
easily
Two
or
are used as
matic fever. 1 The criteria cannot be used, however, for assessing a patient's response
created in a
intelli-
gence or anxiety are examples of substantive indexes. Another example is the use of central venous pressure as an index of
tractive
physician would not
sub-
give
to
cept that does not have a tangible identity.
The
contrived index for the diagnosis of rheu-
patient
created
A
geneous contrived index may sometimes effective for purposes of diagnostic identification, although it may not be applicable to the evaluation of therapy and may obscure the diagnostic constituents. For example, the Jones criteria have provided a highly successful heterogeneous
a
score
incapacitated
be highly
that
or
test
stantive identification to an intellectual con-
search architecture, other indexes will be
his sedimentation rate declines.
to treatment,
deliberately re-
is
who
not be distinguished from a patient
remains
'"'
has a major improvement in
symptoms while
joint
rate
1
to
is
suppose
we wanted
to
study the effect of Excellitol in improving the respiratory
distress
of patients
with
pulmonary disease. One of the variables we would want to identify in chronic
the initial state its
is
severity of dyspnea, but
rating depends on a patient's subjective
perception of dyspnea and on a doctor's subjective classification of degrees of se-
Intake, maintenance,
Because of this double "softness" we might decide to replace severity of dyspnea with "harder" information, such as the patient's timed vital capacity. We have now achieved a more "reliable" measurement, but we have also verity.
the data,
in
•
displaced
phenomenon we wanted
the
Although a good general correlation may exist between severity of dyspnea and timed vital capacity, an individual patient's ad hoc performance on a test to assess.
of vital capacity does not necessarily re-
the respiratory distress he experiences
flect
outside the laboratory in the conditions of his daily life.
ment
investigator, while giving treat-
prevent vascular complications of
to
49
identification
rejecting crucial soft data, but to "harden"
the
soft
data.
When
variables
the
that
are necessary to a well-designed investi-
gation are based on soft data, an additional
aspect of proper design for the research
development and
is
the
of
observation
prove
the
of
methods
better
classification
reliability
the
of
to
im-
essential
data. 2. Validation of contrived indexes. In the examples just cited, the indexes chosen
for assessment of a particular entity
were
displaced into another, obviously different
The
solution to this problem is and requires only that appropriate attention be given to the correct entity. entity.
simple,
1. Reliability versus relevance. Analogous displacements of the objective occur
when an
and
When a contrived index contains a motley mixture of heterogeneous elements, it can be "decomposed"
into a series of separate
diabetes mellitus, studies change in glucose
indexes that are individually homogeneous
tolerance
and meaningful. In other circumstances,
mortality
instead
rate,
evidence of vascular complica-
when
tions; or is
or
test
of clinical
the relief of angina pectoris
assessed from serum cholesterol, electro-
however,
contrived
a
index
seems
that
plausibly or directly related to the desired entity
may be
acceptable
if its
effectiveness
cardiographic evidence, or arteriographic
has been validated by thoughtful judgment
evidence, but not from an evaluation of the
or
clinical
severity of the angina.
placements
may
The
dis-
increase the reliability of
the indexes used in the investigation, but
may
they the for
displace the objective so
also
results do not answer main question that was the reason doing the research. These tactics in
greatly
the
that
index displacement are the source of the "substitution so
game"
forcefully
that
indicted
Yerushalmy 20 has in
modern
bio-
statistics.
The problem
of reliability versus rele-
vance of data is a constant cause of inadequate design in clinical investigation, and the problem occurs in choosing variables that
define
not only the
the population but also
initial
state
(as noted later)
the targets of the subsequent state. the quest for science
of
When
makes the designers
displace critically important data with data that are
research
more
reliable
project
but
less cogent,
may emerge
with
the the
answers to the wrong questions. The solution to this problem is not to continue right
by actual data. Examples of such contrivance are the substantive indexes often used in psychologic research. Since no contrived test can provide an exact assessment of intelligence, anxiety,
or
personality,
the
investigator
must always decide whether the test really measures what he wants it to measure. Is a conventional I.Q. test a "true" measureneeded Are "anxiety" and "personality" really well measured by some of the tests or "inventories" used for these pur-
ment
of the type of intelligence
for creativity?
poses? the
In nonpsychologic research,
substantive
does
index of central venous
pressure provide an adequate assessment of congestive heart failure? In situations just cited, there
method
for
"validating"
is
the
all
no
of the
statistical
index,
and
acceptance or rejection depends on the scientific judgment of the investigator. When the contrived index is created
its
by homologous substitution, the index can sometimes be validated from actual data showing the correlation between the en-
The architecture
50
of cohort research
under consideration and its homologous The investigator must always beware, however, of a correlation that seems satisfactory in an abstract sense, but that may be misleading for the objective of the research. For example, although there is a generally good correlation betxveen height and weight at different ages in a human population, the repeated measurement of height would be a silly index of progress in a study of dieting to reduce obesity. Similarly, the enumerated consumption of nitroglycerine tablets would be an analogous but unsatisfactory index
be expressed
titv
ly
index.
guage".
indexes, the variables,
dence,
the
he
can
establish
his
own methods
of
observation for each item of evidence, he
must ascertain that the observational procedure is standardized and reliable. The
may
ascertainment
require calibration of
equipment, appraisal of observer variability, attempts to remove observer bias by "double-blind" procedures, and other tech-
niques that deal with quality control of
generally correlate with the severity
the primary data. If the investigator can-
of
severity
of tablets used
not establish his
own
important functional aspects of "severity,"
contemplate
the
flaws
such as the patient's ability
been present
in the
to walk,
work,
or engage in other acts of physical exer-
Computer distortions. As computers become used increasingly for storage and analysis of data in modern research, an 3.
additional
new
cause of the "displaced-
objective" problem
is
not the goal of scien-
tific reliability, but the convenience of computer compatibility. For processing by computer, the basic evidence must first be converted into "machine-readable language." In designing the formats needed for these conversions, the computer per-
sonnel
may omit
of crucial information for
many
could be prepared
enough raw
data have been collected for the topics
and categories to be discerned and coded, but by that time the investigators may have become infatuated with the analysis of what is already in the computer. They may thus ignore crucial raw evidence that rema uncoded and unanalyzed, or they c
stricted
flne their
assessments to the con-
Election of data that could readi-
for
and
be considered again subsequent papers of
will
this series.
F.
Methods
of classification. After the
primary data have been obtained, they
must often receive further before
variables
classification
perform their role as and indexes. For example, if a variable to be specified as can
they
or
A
lfl
in greater detail in
is
later, after
issues of quality con-
numerous and complex
They have been discussed
architecture.
elsewhere'-
present
suitable format
procedures by which
the scope of this outline of biostatistical
which a suitable
the general topics and specific cate13
too
are
anemia
format can not be easily prepared at the beginning of the research project, because an appropriate taxonomy does not exist gories of the coding. 12,
trol
kinds
or displace
methods, he must that might have
the available data were collected.
The methodologic
tion.
may
evi-
can contemplate
the acquisition of the basic evidence. If
of the angina, but does not indicate other
for
and the basic
investigator
by a patient
total
The number
may
E. Methods of observation. After all these decisions have been made about the
angina pectoris.
the
of
"machine-readable lan-
in
absent,
a
description
of
the
method for measuring hematocrit does not indicate "anemia" until the hematocrit values receive classifications that assign ranges of values for
anemia and nonanemia. The
taxonomic methods of classifying data are also beyond the scope of this discussion, but two aspects of the problems can be briefly cited with regard to imperfect data
and intermediate
criteria.
Management of imperfect data. The management of imperfect data requires 1.
decisions about information that
is
missing
or that creates ambiguity because of dis-
agreements or contradictions. For example, be diagnosed as pres-
will angina pectoris
Intake, maintenance,
cut
the patient
if
pain" that
said to have a "chest
is
not further described, or
is
one examiner that the
the patient
tells
symptom
present but denies
is
if
it
in dis-
cussion with another examiner? Will ane-
and
statistical architecture,
methods
the
be used The methods
targets.
to
our tunc tin will
the decisions require scientific judgment,
in the
and the main issue
subsequent identification
establish repro-
ducible consistency in the methods used for
making the
decisions.
Intermediate
2.
servational evidence in data of the initial
cri-
may state
terms that enable their specific appli-
For example, criteria for a diagnosis of myocardial infarction may require "abnormal Q waves" in the electrocardiogram but may not indicate what is meant by an "abnormal Q wave." The role of intermediate criteria is to provide cation.
of the specifications for the decisions transfer
that
from
the
to
the
evidence
observational
expression in the primary data
its
categories
various
often unspecified in these for
intermediate
the
scientific
They
results.
methodologic important
are
many
necessary
are
reproducibility
a
data
part
crucial
used
process
soft
research projects,
criteria
and
of
the
of the
"harden" make hard
to
to
data meaningful.
may
Many
Subsequent identification
and
require
will
initial state contains no need examination, but symp-
dosage, their
its
symptoms
that
toms may occur as "side effects" in the subsequent state. Additional examination procedures may be needed for laboratory
and other first
new phenomena
of
tests
B. Additional
and other state
criteria.
The
SUBSEQUENT^ STATE
Subsequent
may
not provide for
target that
is
to
many
be prevented by a ma-
stipulation as part of the subsequent identifications.
When
a particular disease
project
was
©
un-
is
diagnostic-
will often be necessary untoward "side effects" of maneuver from the expected or desired criteria
to separate the
a
effects.
^
In the types of criteria that have just
of
the
research
specified, a series of targets in
subsequent state differentiated. At this the
is
der treatment, special criteria of "diagnostic co-morbidity" 14 may be necessary to decide
been described, an index objective
situations
For example, a
neuver will not be present in the initial state, and its diagnostic criteria will need
Identification
the
diagnostic
criteria established for the initial
that occur subsequently.
distinct
When
that
appear after the maneuver.
the maneuver, or to the development of a co-morbid disease. Another group of
MANEUVER
© STATE ®
initial
arrange-
suitable
ally attributable to the original disease, to
© ^..INITIAL
new
of the vari-
not have been present in the
whether a new manifestation 5.
include
ments for observation and interpretation. For example, if asymptomatic healthy people are being given a new drug to assay
designation,
of
and inference that constitute the major criteria 12 for the variables and indexes used in the investigation. Although appraisal,
however, the
identification,
ables encountered in the subsequent state
teria established for diagnostic, prognostic,
all
initial
kinds of data, criteria, and problems.
or eligibility decisions are often not stated in
of
same methodologic challenges present
A. Additional data.
The major
many
In addition to containing
state-.
utc
still 8'
a
in
35 ' 3G
'
39
'
state
of
controversial
dis-
If
an investigator can
choose the maneuvers, as in a therapeutic or explanatory experiment, the main challenge of allocation
is
to
develop a suitable
method of assigning patients to the selected maneuvers. The principal decisions involve judgments about premaneuver numerical equality of popu-
lations, size of total population,
and
allo-
cation procedure.
Pre-manenver stratification. If m difmaneuvers are to be studied, the peor entered into the experiment must obvio ly be divided into m groups. Before ti m manuevers are assigned, how1.
fen
h
prognostic strata for the main target
seldom clearly known before the experibegins,
and,
many
besides,
as
discussed
different stratifications
42, 48, 50
B. Experiments.
stratification,
of
might be needed for each of the diverse ancillary targets of the maneuver. The omission of such a pre-allocation stratification, however, does not absolve an investigator of the need for appropriate prognostic divisions afterward. Unless the patients are suitably divided into groups with different "risks" for each target, the investigator may ignore fundamental dif-
ferences in the natural events
upon which
experiment was imposed. He may mix moribund and asymptomatic patients improperly in evaluating the outcome and in performing analyses of variance or other 17 :9 These analyses statistical procedures. may produce misleading results that rehis
'
57
Subsequent implementation of the objective
main uncorrected because the investigator has failed to discover that certain maneuvers may have had antipodal effects on different prognostic groups
1
17
"'
or that the
eVen though "ranhave created major prog-
allocation procedures,
domized," nostic
may
disproportions
among
the patients
assigned to the compared manuevers. prognostic appropriate Nevertheless, are constandy neglected in the analysis of clinical experiments. The consultant statistician and principal investistratifications
convince each other that the stratification is either unnecessary, because the randomized allocation "took care of gator
may
everything," or inappropriate, because of
the post hoc timing of the analysis. In other instances, investigators attempting a post hoc stratification may discover that
some
of
necessary
the
were not
data
obtained as part of the description of the initial state. For example, many computerized collections of information about the treatment of cancer, diabetes, or other
major
diseases
chronic
thoroughly information
analyzed
about
cannot
now be
because the coded
the
initial
state
does
of symptoms,
not include enough chronometry, co-morbidity,
details
or
other
im-
23
22 portant prognostic features. As discussed later, the process of ran'
an excellent way of allocatdomization treatment in an unbiased experimental ing manner. But if an appropriate prognostic analysis is not performed either beforehand is
or afterward,
the experimenter
may
find
he has removed statistical bias with the randomization, but has also removed clinical sense. To avoid making this paper unduly long, the strategy and tactics of post hoc prognostic stratification will be that
reserved for discussion at a later date. 2. Number of patients per maneuver.
In at least two circumstances, however, numerical assignments to different maneuvers
the
may
be deliberately unequal. In one such circumstance, the investigators intend to test two separate maneuvers whose results will later be combined for
comparison against a third maneuver. For example, placebo might be compared against two dosage levels of Excellitol, with the Excellitol consolidated if desired. In such a each of the two "combinatorial" maneuvers could be assigned half the number of patients assigned to the third maneuver; and when the results of the two maneuvers are later comresults
later
situation,
bined, the numbers will equal those of the third.
An unequal number
patients
of
may be
as-
signed in a different type of circumstance when a maneuver suspected of being distinctly inferior
must be tested
in a therapeutic trial in order for the suspicion to be proved. In this situation, the suspectedly inferior maneuver is sometimes allo-
cated to a smaller number of people than the other maneuvers. 3.
Size of the population. At least three
methods can be used for determining the total size of the population entered into an experiment. In the "calendifferent
dar" method, a fixed period of time is used for populational intake, and the ultimate size of the population depends on the number of people assembled during that
calendrical
interval.
In
the
"fixed-
method, a fixed number of people is chosen for admission, and the intake of
size"
population continues until that number reached.
The number can be chosen
is
arbi-
or from statistical formulas based on the magnitude of difference that the investigator hopes to demonstrate. 14 4G In the "sequential" method, 2 pairs of patients for the two compared maneuvers continue to be successively entered into the experitrarily
'
ment
until
the
results
reach boundaries
of statistical "significance" or "nonsignifi-
cance."
Each
of these three
methods of choosing
populational size has
its
own
advantages
In most experimental plans,
and
In
the
"calendar"
equal divisions usually provide the best statistical opportunity for demonstrating
method, the number of patients admitted during the chosen time interval may turn out to be too small for "statistically significant" results. The project might have to be either concluded without attaining statistical proof or extended over a longer
equal numbers of patients are assigned to each of the maneuvers under comparison, since
a significant numerical difference the maneuvers.
37
among
disadvantages.
The architecture
58
time
interval
get
to
of cohort research
more
patients.
In
contrast to these hazards of the "calendar"
method,
both
the
"fixed-size"
and
'se-
quential' methods offer the attraction of
while the "soft" endpoint would need only 70.° Confronted with the huge amount of extra patients and associated labor required to get numerical "significance" with the "hard" endpoint,
assuring "statistical significance." Both of
abandons
the latter two methods, however, also have the logical handicap of being based on a single target of response. Consequently, the calculated numbers that might
numbers
yield "significant" differences for the single
may not be adequate many other targets that
target
for analyzing
the
often require
attention
a clinical experiment. In ad-
in
procedure is limited comparison of two maneuvers, and the target state must be an entity that can be assessed promptly after the maneuvers, to enable decisions about admitting dition, the "sequential"
a
to
Despite
when
the target of the investigation has
been displaced, as described previously, 20 into a hard data "endpoint" that has a much lower rate of occurrence than the "endpoint" for the appropriate entity of soft data. In
sample
size
the
tliis
situation, the
estimated to give "statistical signif-
may be a number that is massively higher than what would have been required with the icance"
to
so discouraged that he conduct the experiment.
problems
cited
and
the
logic,
both
in
calculations
"sample size" according to "fixed" or quential" methods has site
of "se-
become a prerequi-
of "statistical design" for con-
ritual
temporary clinical trials. 10 " 43 The maneuvers chosen for comparison may be inadequate or illogical; the investigators may not know how to perform a suitable prognostic stratification beforehand and '
may
neglect
may
criteria
may method
additional hazard of the "fixed-size"
occurs
plans
his
the eligibility
afterward;
it
vitiate
any numerical
antici-
on previous experience and
pations based
the next pair of patients.
An
may be
the investigator
admission
preclude
of
the
diverse
needed for the trial and the initial to targets of response may be variables or chosen inappropriately and assessed improperly; but the statistician, undaunted by these imprecisions, may marshal his «, /?, and 6 values and calculate the exact number of patients needed for "statistical types
be
patients
of
clinically meaningful;
"softer" target.
Suppose, for example, that our objective is to test a treatment intended to improve the severity of angina pectoris, and we want the incremental results (or Q value) in the treated least
30 per cent better than
in
group
to
be at
the "controls."
For our "endpoint" we decide to reject the soft data target of "clinical improvement of severity" and instead we choose the hard data of "fatality rate." This decision will greatly alter the
number
needed for "significant" results in the experiment. If "clinical improvement" can be expected in 70 per cent of the control group, we of patients
would have demanded (according to the Q specification of 30 per cent) that 91 per cent of the treated group "improve"; but tality rate in the control
group
if is
the expected fa-
10 per cent,
we
shall demand that this rate be reduced to 7 per cent in the treated group. With these specifications, and with a 0.05 and P 0.05, we
=
would have needed
a total of
=
86 patients
to get
with the "soft data" endbut with the "hard data" endpoint we need 2,240 patients in the trial. Even if
"statistical significance"
point, shall
significance."
Although an estimate of the number and length of time is always
of patients
desirable
for
a
large-scale
investigation,
particularly for cooperative studies at multiple institutions, the
lamentable aspect of
the current "numbers game"
is
that the
intensive planning given to these statistical
desiderata
is
often the
main focus of the While the
activities in "statistical design."
numbers are being meticulously however, sities
many
calculated,
of the fundamental neces-
of scientific logic
and data may be
ignored. 4.
Method
of assignment. Regardless of
how
the numbers of groups and patients
are
determined,
the
investigator
must
choose a method of assigning patients to each maneuver. The main scientific ob-
we became more
"liberal" and raised the p level 0.10 while preserving a at 0.05, the "hard data" endpoint would still require 1,811 patients
to
"These calculations were based on the formulas cited 221-222 of reference 46. I am indebted to Mrs. Elizabeth C. Wright for checking the calculations. in pp.
Subsequent implementation
method
ject
of this
the
assignment.
which
dure,
A
is
to
avoid bias in
"double-blind"
proce-
helpful in preventing bias
is
when subsequent examinations
are
per-
when when
formed, would also help avoid bias
maneuvers are allocated. In addition, one manuever is initially suspected to be distinctly better than another, a doubleblind allocation has the further advantage of freeing the investigator from "moral" qualms that may arise if he knows which patients have been assigned to the "inferior" maneuver. For example, in the
ciple
step
of research
the
in
59
of the objet Hi e
design until
architecture
sixth
this
—and
then only for research projects that are experiments, rather than surveys. Although many clini-
have been inordinately slow to accept the need for randomization in planning cians
and experiments, many stathave been overly zealous in promulgating randomization as a panacea for clinical trials
isticians
flaws in "experimental design."
As an ad-
junct to well-planned experimental archi-
randomization makes a powerful
tecture,
contribution to
modern
science; as a substi-
large-scale field trials of a poliomyelitis
tute for suitable scientific logic, randomi-
vaccine, 26 a double-blind technique helped
zation serves to perpetuate defective re-
first
the
avert
ethical
quandaries
arose
that
about assignment of patients to the vaccine or to the "control" preparation.
would
cation
solve the scientific
just cited,
it
and
design"
"statistical
Although a double-blind method of problems
performed and often accepted
search,
allo-
by the
Intrusion
7.
©
is
assignments.
full
this outline,
when maneuvers are assigned nate manner or in any other manner (such
but
its
may
arise
an
alter-
in
"systematic"
as patients' unit
guards against the bias that
for
results
dom
it
Intrusion
the patient; the maneuver or the observation period
may
stop participating in the project.
legiti-
subsequent assessment of with statistical tests based on "ranof randomization
is
made by
one in-
statistics to modern clinical reand the frequent scientific abuse of the procedure, as described here and elsewhere, 17 should not detract from its great value when properly employed. Despite the importance of randomization, how-
maneuver and subsequent
The
state
are con-
sidered in this stage of the architecture.
A. Nonadherence to maneuver.
may
A
per-
not faithfully adhere to a maneu-
ver that must be maintained over a long period of time. Oral or injectable medica-
ductive
tion, or
ing,
does not appear as a cogent prin-
not be performed; or the patient
various intrusions that can appear between
son
sampling."
it
by an untoward
may
occur
search,
ever,
altered
tions
provides mathematical
the major contributions
may be
event; the necessary subsequent examina-
numbers);
the
The technique of
(D
©
may
when double-blind techniques cannot be and
STATE
After a maneuver has begun, the path-
domization averts the bias that
macy
*©
©
way from initial state to subsequent state may be interrupted in several ways. The maneuver itself may be abandoned by
beyond the scope of
used;
MANEUVER ^SUBSEQUENT
procedure of randomiza-
value can be summarized as follows: ran-
it
INITIAL
©STATE©
description of randomization
statistical
A
alone adequate for
ethical
does not solve the
problem of establishing an order A specific sequence must be chosen for allocating the maneuvers, even though the person who later administers each maneuver may not know its identity. This sequence is best selected tion.
is
valid experimental science.
practical
for the
in
the complacent delusion that a satisfactory
such maneuvers as cigarette smoktaken erratically or discon-
may be
tinued entirely by patients less
who
neverthe-
continue to be observed in the project.
In stage three
of
the
research
archi-
tecture (dealing with maintenance of the
population 20 ),
arrangements
were
made
The
60
architecture of cohort research
to obtain data
about adherence to maneu-
for
diagnostic attribution. For example
its
1 ,
In this seventh stage of the architectural design, the main challenges are
during long-term treatment of a chronic disease, the attending physician (or the
the fidelity of the adherence decide about the way in which the results obtained in poor adherers will be ascribed (or not ascribed) to the' asso-
investigator)
vers.
to
classify
and
to
ciated maneuver.
must decide whether each is due to the
posttberapeutic manifestation
treatment itself, or to features associated with either the evolving main disease or co-morbid ailments. Thus, for appropriate classification of anorexia that
Suppose
patient
a
is
assigned
tn
take one oral
tablet twice daily, on awakening ami at bedtime. How many and what kind of deviations from tliis
prescription
will
constitute
regimen, and
different
decrees
how many
fidelity
to the
fidelity
should he established? Should
of
classes of
we simply
adherence for all patients as either good or not good, or should there he four categories oi adherence, such as exct Ben*, good, fi'ir. and poor? What will he the specific criteria lor classifying deviations in the intra- and interdiumal patterns? Will we he satisfied if the patient takes his two tablets each d.tv. but hoth at one time, or if he ingests one tablet at lunch and the other at suprate the
per? Suppose he forgets one tablet on one day and takes three the next? II he omits the tablets tor three
make
a
days
in the
course of a month, does
it
difference whether the three days occur
consecutively
or
sporadically?
No
statistical
an-
are available for anv of these questions, which require arbitrary judgments individualized for each research project. Another tricky problem occurs in ascribing the results obtained in persons who have not maintained the maneuver faithfullv. Suppose daily doses of Excellitol were given to lower blood cholesterol, and suppose the adherence to the regimen can be classified as good, fair, and poor. At the end of the project, we would not be surprised if the cholesterol were significantly lower in the good adherers than in the poor, but how would we interpret an even lower result in patients with only fair adherence? Suppose the placebo patients with good and fair adherence had sub-
swers
lower values for cholesterol than the adherence Excellitol group? Again, these questions cannot be answered with statistical or scientific theories, and each decision must be made with a logic appropriate to the problem. stantially
poor
B.
Untoward
events.
The
patient's abili-
ty to maintain an on-going maneuver, such as a diet or medication,
by
may be impaired
the development of an adverse clinical
develops after
chemotherapy of a cancer, the investigator must attribute the anorexia to functional cancer,
of the
effects
a reaction
to
(i.e.,
produced by the chemo-
a "side effect")
agent, to an associated comorbid disease that was present before
therapeutie
or after treatment, to a psychic depression patient's emotional response
evoked by the to
his
combinations
to
illness,
these
of
factors, or to other causes.
The way that these "co-morbidity" decisions made can profoundly affect statistics about the
are
of
target
research
a
rates
fatality
who
project.
are
For example, with
patients
of
surveys
pathogressive
based on
usually
all
in
cancer, patients
died, regardless of cause of death. Thus, a
whom
patient in
cardial
had been successfully
a cancer
removed but who
died because of a myo-
later
infarction or in an automobile accident
same way
often statistically counted in the patient
who
example of
as
is
a
died of disseminated cancer. Another this
problem
the misleading rates
is
created from using mortality data of
of disease
the Bureau of Vital Statistics, where each patient's
death of
is
attributed
to
how many major The
tions
results of
also
a single
diseases
cause, regardless
were present. 1,;
two recent large-scale investiga-
illustrate
some
of the difficulties
that
can occur in evaluating deaths and associated treatments. In comparison with the "control" group, patients with prostatic cancer who were treated with estrogens had fewer deaths "due to cancer" but more deaths ascribed to "other causes," so that the total fatality rates in the two groups were essentially the same. 49 Domiciliary patients receiving a low fat diet had fewer cardiovascular deaths than the "control" group but more noncardiovascular deaths, so that the total fatality rates in
the two groups were similar. 13
The problems
of "attribution" in "diag-
nostic co-morbidity" have
been discussed
condition not present in the initial state.
elsewhere, 21 and are beyond the scope of
Beyond any effect on maintenance of the maneuver each subsequent clinical event or other mifestation must be analyzed
this
i
here
outline. is
statistical
that
The main no
point to be noted
thoroughly
satisfactory
procedures have been developed
Subsequent implementation of the objective
for these problems.
rently
depends
Their management cur-
on the sensible use
judgment.
scientific
Displacement
C.
of
vented earlier
A
examinations.
of
The
bias.
the
that
20
ties for
flaw
maneuver
—may
in
itself
create
unequal opportuni-
subsequent
A
procedure used in the maneuver.
was sched-
particular examination that
uled for repetition at specified intervals
may
not have been done;
it
may have
been performed on dates other than the ones planned for it, or it may have been "displaced"
by some other
Suppose
—a
value
that
extraordinarily
is
range of expected values for the test? Should this "outlyer" be rejected from consideration or accepted and in-
cluded
among
general
the other data?
Conventional
37
50 '
-
40
for
seldom suitable for the circum-
stances just described.
The
tactics usually
apply to situations in which a group of objects have all received the same test at the
same time
No
provision
is
in
a single performance.
made
other
logical
problem of and for manag-
for the
interpreting repetitive tests
ing the
difficulties
in
dis-
placement of data. If
the compared maneuvers have been
allocated
may
tor
by randomization, the
investiga-
take false comfort from the belief
that subsequent displacements of data will also
occur randomly and that almost any
rational
method
satisfactory,
daily penicillin
oral
domly allocated
and monthly
in-
in
an experimental
of their
trial
capacity to prevent subsequent streptococcal in-
young patients who have had rheumatic and suppose streptococcal infections will be
detected via comparison of antibody
of analyzing
them
will
be
since the randomization pre-
titers
in bi-
monthly specimens of sera. If the patients assigned to monthly injections must appear at the research clinic in order to receive the injections,
the
bimonthly
specimens
of
be
can
sera
ob-
tained as part of the circumstances surrounding the
administration
of
On
the other hand,
if
the
therapy.
prophylactic
the patients taking the oral
medication can receive their monthly supply by mail, the routine acquisition of serum specimens will
require a special
examination
that
is
visit
to
the clinic for an unnecessary.
therapeutically
According to the way this inequality of "maintenance" is managed by the investigators, the serologic specimens may be obtained with greater diligence and regularity in the "injection" group than in the "oral" group, or vice versa. A difference in the rate of streptococcal infections in the two groups may thus arise from these problems of intrusion, rather than from the therapeutic maneuvers.
The
statistical tactics
dealing with missing data and with "outlyers" are
some of the bias caused, after by administration of the
fections in
For example, if serum antibodies are to be every two months for detection of intercurrent group A streptococcal infections, should we regard a particular patient as adequately tested if a particular examination was delayed so that a three-month interval elapsed between specimens? If the test after a three-month interval shows that an infection has occurred, in what period does the infection get counted: the previous two-month period, or the next one? What about the situation in which the timing of the test was satisfactory, but a different test was done? Thus, would we regard a patient as adequately examined for group A streptococci if the tests were based on throat cultures rather than serum antibodies? And what about the type of "displacement" in which a particular test produces an result
may
data
of
of long-acting penicillin have been ran-
jections
test.
tested
higher or lower than the
displacements
randomization,
fever,
"outlyer"
The
detection of the target state.
thus reflect
is
described
as
problem occurs in the patient's or the investigator's adherence to the planned examination procedures.
third type of intrusive
belief
this
—
61
which can randomized allocation of maneuvers in an experimental trial, must be especially guarded against when the research is conducted as a survey. As noted previously, 20 unequal opportunities difficulties just described,
occur
for
despite
detection
women
of
the
target
may
taking oral contraceptive
cause
pills
to
thrombophlebitis or cervical cancer than women using mechanical devices. Similarly, since people with a chronic cough are more likely to have chest x-rays taken than people without a cough, and since smokers are more likely to have a chronic cough than nonsmokers, some of the higher rate of lung cancer in smokers may be due to their greater opportunity for having lung cancer detected when it occurs.
have a
falsely higher incidence of
The
62
archif
dure
of cohort
research
D. Displacement of patients. The last problem to be cited here is the
intrusive
in
an
during
patients
of
loss
investigation
which the target state occurs long after initial state. Suppose the main target
the
of investigation
is
a discrete event
When we
as death.
tabulate the frequency (in such
event
of
the
as
"5-year survival rates"),
target
an earlier date, but later
up"
5-year
the
at
expressions
what
done about patients known at
—such
to
shall
be
'lost to
interval
be
alive
follow-
selected
for
kind of "serial intake" problem, the
this
method
life-table
The
life-table
is
the use of an "actuarial" or "life-table" analytic ll > 3S
procedure. *•
In this procedure, a numerator
and denominator population are counted at periodic Intervals, such as 1 year, from the onset of the maneuver. The denominator for each interval consists of an appropriate count of the people who began that interval; the numerator consists of all people
in
whom
the target event occurred during
interval A rate of the target event is calculated from the numerator and denominator for
until
the
fifth-year
are
rates
multiplied
by the
product of the four preceding rates to yield the final result.
The
life-table
approach
is
particularly
project
curs
in
which populational intake oc-
serially
interval.
less satis-
one of
is
When
"losses
to follow-up"
occur during
a particular interval in the life-table calculations, is
denominator
the
sum
created as die
who were
for
suppose
the interval
of all the patients
followed throughout the inter-
together with half the
val,
number
of those
and the "drop-outs" do the numerator. For example,
lost;
death
that
is
the
target
event
90 people who were followed throughout an interval during which 20 other people were "lost." The denominator would be 100 [ = 90 + 10],
and occurs
in 5 of
and the death rate for the be 5 per cent [ = 5/100].
interval
would
Regardless of the theoretical statistical this "half-life" procedure, a
support for
uneasy that unproved assumptions have been made about the scientist
will
feel
A more
fate of the "lost" patients.
effective
approach to this problem would be to obviate any guesswork based on statistical theories or scientific logic, and to establish adequate epidemiologic methscientific
ods for following the population carefully enough to provide information about the actual state of all members, thus eliminating the need for conjectural assumptions. If
valuable for analyzing data in a research
much
is
the problem
drop-out" rather than "serial intake."
that
each successive interval, and the "final" rate for the most advanced time period is the product of the rates calculated for each antecedent interval. Thus, the five-year survival rate would be obtained by first finding the one-year survival rate in people who have been followed for one year; this rate is then multiplied by the survival rate during the second year in people who were followed for a second year; the product of the first two yearly rates is then multiplied by the survival rate during the third year; and so on,
if
"serial
not appear in usual statistical approach to this problem
The
method
factory, however,
who were
analysis?
an excellent ad-
offers
justment for duration of follow-up.
the investigators
make
vigorous efforts
to "trace lost persons," the logical hazard
of
a
"serial
drop-out"
managed by avoiding
life-table
can be
it.
over an extensive calendar
Since not
all
the patients
8.
will
study at the same time, they have different lengths of follow-up the calendar dates on which the in-
Transition
©
enter the will at
vestigator
prepares
"progress
reports."
Thus, about four years after the project has begun, the investigator may want to calculate 3-year survival rates,
patients
who
entered
the
but many
project
^INITIAL,-. MANEUVER ^.SUBSEQUENT^ STATE (3) STATE
©
©
©J<
*©
>-©
Tronsition
only
18 onths ago have not yet had the oppoi unity to survive for three years. For
After suitable provision has been
made
for the types of intrusion just described,
Subsequent implementation of the objective
standards must be established for assessing
from the of each per-
the transition that has occurred initial to
the subsequent state
son in the population and for the transition
each population group. A. Types of change. The assessment of change involves subtle issues in chronology. As noted previously, 20 the single value of an index for a particular state in time can in
depend on unitemporal or multitemporal contributions.
(
For example,
the
maneuver was maintained represents a single monadic "subsequent state." Examples of such
the
indexes
are
pectoris
or
the the
number of attacks of angina number of nitroglycerin tablets
during maintenance of an antianginal drug regimen. In the examples just cited, the indexes were individually multitemporal, because they contained contributions from several points in time; but their sum constituted a monadic description of "change," because the pretreatingested
ment
state of the patients did not enter into the
calculations.
initial-
or subsequent-state value for blood
state
63
2.
Polyadic
changes.
In
contrast
to
pressure can each be a single reading or
monadic
the average of a series of readings.)
changes represent distinct "transitions" because a value for the initial state is contained in the assessment of the "change." A polyadic change is based on the value
gardless of the
number
Re-
of temporal con-
tributions included in the index value for
a single state, the assessment of a
may
change
more
involve values for one, two, or
According
states.
that are
to the
compared
number
of values
for the delineation,
a
change can be monadic, polyadic, or dyadic.
Monadic
1.
change,
the
changes.
person's
In
initial
really a constituent of the
for
monadic
a
state
is
not
data evaluated
delineating the "transition." There
is
no actual comparison of a "before" and "after" condition, because the "after" event may not have been present initially, or the subsequent state may be assessed exclusively on the basis of what happened during or after the maneuver. For example, the development of poliomyelitis is the monadic change used as the target to be prevented in contrapathic vaccination of healthy
people;
change
to
death
monadic
the
is
be prevented
as
contratrophic treatment for
target
a
of
many people
with cancer. Monadic indexes are commonly established the
score
of
a
particular
test
as
used to describe
a person's condition after the main "maneuver" has already occurred. For example, an I.Q. test
may be
given to a group of well-nourished and
and the investigator, on the basis of the monadic scores, may attempt "retrospectively" to compare the influence of nupoorly-nourished
children,
on intelligence. Another type of monadic
trition
index
is
used
for
which the target in the "subsequent state" is an entity that can occur repetitively during the course of an on-going maneuver. In situations in
this
circumstance, the entire period of time that
changes,
polyadic
and
dyadic
more and the index
of a quantitative variable at three or different
points
in
of change depends
time,
on the
line or curve
that connects those points.
For example,
if
a person initially weighed 130 pounds
age 15, and then weighed 160 pounds age 18, 190 pounds at age 21, and 220 pounds at age 24, the weight curve has been linear, with an upward slope of 10 pounds per year. A polyadic change is at
at
commonly expressed determined with
as
statistical
a
"trend,"
and
procedures that
find the best-fitting straight line
(or non-
linear curve) for the collection of different
temporal points. When a linear model is used for "fitting" the line, the trend is usually expressed as the "slope" of the Quadratic or other models can be used to fit trends that have distinctly curved shapes or that go up and then down, or vice versa. 3. Dyadic changes. Dyadic transitions are constantly used in therapeutic research for evaluating either remedial treatment, intended to make a patient's condition line.
"better,"
or
contratrophic
treatment,
in-
tended to keep him from becoming "worse." Unlike the monadic and polyadic changes just described, a dyadic change contains a direct comparison of two temporal states: before and after the maneuver. The variable used for determining a dyadic change must be expressed in values that have
The architecture
64
of cohort research
graded ranks. The ranks can be measured or .) counted dimensions (such as 13, 14, 15, or semiquantitative ordinal ratings (such as high, medium, or low). For dimensional ranks, the transition between the initial and the subsequent state can be expressed as a subtracted increment .
(or decrement
or
I
as
a
.
percentage increase (or
tional decisions
individual
aspect
their capacitj
tative data
1
dyadic
transitions
converting qualitative or quanti-
lor
into semiquantitative categories
on concepts
of
is
clinical
or
biologic
based
desirability.
Thus, the ordinal rating scale ol better, same, or worse ior dyadic changes can be applied both to quantitative phenomena, such as a reduction in temperature, or to such qualitative alterations as
a
disappearance
of
symptoms, or a change
people
for individual
gories as higher, same, or lou interesting
transition
ing
populational
a
in
the group, addi-
become necessary index
already
the
the problems of dyadic tran-
research
established
who
has
criteria
for
architect,
of
sets
identifying entities in the initial state
subsequent special
new
state,
and
must now establish
set of criteria for the
between the two
states.
criteria" will require
These
many
a
changes
"transition
scientific judg-
ments beyond those that have already been necessary. One set of judgments deals with the conversion of dimensional data into transitional categories. Thus, an initial Westergren sedimentation rate of ISO mm. per hour and a subsequent value of 175 mm. per hour might each be called markedly
elevated in their single states, but would
mm. per hour in the warrant being called a fall? A second set of judgments deals with the the decrement of 5
two
states
magnitude of
in
choos-
can sum-
that
marize the results of the individual indexes the entire group.
Types
1.
the
If
populational
of
populational
indexes.
A
index can be derived from
indexes
in
at
index
individual
least is
two ways.
expressed
in
categories, the proportionate frequency of
categories can be enumerated in the population;
expressed dimensionally,
if
its
aver-
age value can be calculated. For example, the categorical
congestive
response to treatment of
heart
failure
as excellent in 9 per cent,
To manage
of a
in
a
group
of
78 patients can be cited proportionately
in color of urine.
sition,
each
for
member
lected variable can thereby be expressed
in
An
of
population. Although a change in the se-
Thus, if the sedimentation rate was 50 mm. per hour before the maneuver and 30 mm. per hour afterward, the change can be c\prcssed as a fall of 20 mm. per hour or as a 40 per cent reduction. For ordinal ranks, the transition is expressed in such comparative catereduction).
index
individual
important variable in each
transition in semiquantitative
ordinal categories. For example, if cardiac enlargement is graded as none, slight, moderate, and extreme, what pairings of categories will be used to denote such transitions as much smaller, smaller, un-
cent, fair in 14 per cent,
good in 65 per and poor in 12
per cent; the survival time in 112 patients
with cancer can be averaged as a median value of 8.3 months, or as
When
months.
the
mean
individual
of 9.1
index has
been expressed dimensionally, the populational index can still be expressed cateThus, in the 112 patients just could have been categorically stated as a rate of 42 gorically.
cited, the populational survival
per cent at 6 months.
(In the latter ex-
pression, the "categories" of survival alive at 6 montlis
When
and dead
were
at 6 months.)
transitions have been exdimension that was measured repetitively, rather than only at the two times of initial state and subsequent state, the populational performance can be calculated with a "trend" equation, using some of the mathematical
pressed
individual
in
a
models described previously for determining trend an individual person.
in
2.
Problems in denominators.
A
group
index expressed in proportionate categories
the procedures just cited for describing
of a ratio: the numerator is the frequency of occurrence of the selected category within the population of people or events enumerated in the denominator. (Thus, in the earlier example, the 9 per
a change, the investigator can establish an
cent excellent response rate in congestive
cluinged, larger,
B. Population
and much larger? indexes. Using
one
of
consists
Subsequent implementation
heart failure
was based on 7 such responses
among 78 treated patients. The denominator of such expressions may sometimes be displaced inappropriatewhat Mainland has called "wrong
ly into
sampling units" or "spurious replication."' 7 As an example of the problem, Mainland cites the attempt to measure personal resistance to new caries by counting the number of carious teeth in a group of people, and dividing this total
number
number by the were counted,
of teeth that
instead of dividing the
number
of people
with carious teeth by the total number of people.
The choice
an appropriate denominator is subtle when the numerator consists
of k events
and when the data available
for the
m
to
know
the risk of
new
carditis
in
a
recurrent attack of rheumatic fever, or in a pa-
who
has a recurrent attack.
would be 54/105, but the
attack
The
risk
risk
per
per patient
cannot be determined from these data because
we would need number
of
know,
to
patients
54
rather
than
had been observed.
the
as
numerator,
who had had new
carditis
recurrences
in
the
carditis
which new
When the numerator consists of an event that can occur repetitively, such as streptococcal infections or episodes of angina pectoris, the denominator is often converted to the total time period of observation for the persons in the population. Thus, the occurrence of 40 streptococinfections
cal
in
100
patients
observed
for
a
200 patient-years is often reported as an attack rate of "20 per cent per patient-year" rather than "40 per cent per patient." This type of person-time denominator can be useful for many situations, but it carries the hazard of possible distortion by the way the unit of time is chosen and by major disparities in length total of
of the observation period for individual
of
the
cited
population.
per
miniscule
Thus,
on "200 patient-years" would be misleading.
There no statistical criteria for deciding whether a populational index is best ex3.
Strategies in transstratification.
are
pressed as a curvilinear trend, a categorical
dimensional average. For aver-
ratio, or a
there are also no
ages,
mean
ferring the
for
criteria
median
or the
pre-
as
the
choice
of
these
de-
cisions
must be made judgmentally
ac-
expression.
cording to
the
An example
logic
All
of
each
of
situation.
of the considerations
is
pre-
if
have
seemed
expressed as 0.55 per cent per pa-
expressed as 200 per Furthermore, 200 patient-years could be obtained by observing each of the 100 patients for about 2 years, or by
and gigantic
cent
patient-decade.
where the median is mean and where "rate" may be preferable
survival in cancer,
usually preferable to the
the categorical to
both the "average" measurements.
A
critical
aspect of these indexes
is
not
but their correlation with the strata of population to which they refer. Suppose a population contains 50 people whose value for a particular index in the initial state is high and just
their
selection
50 people whose value is low. Suppose further, in the transition results, that the value has become higher for 50 people and lower for 50 people. Did these changes occur throughout both groups or did the high results become higher and the low ones lower, or vice versa? Unless the results
are
transstratified,
we might
fail
detect the major differences between one treatment that raises high values and to
lowers low values and another treatment that has just the reverse effects,
making
high values low and low values high.
9.
Induction
members
the 20 per cent just
would
patient-year
tient-day
per
people for about 2 months each. In the latter circumstance, a denominator based
of
denominator consist of episodes in n people. For example, suppose we know that evidence of new carditis occurred 54 times in 105 recurrent attacks of rheumatic fever in 78 patients. 54/105 or 54/78? Is the rate of new carditis The answer to this question depends on whether
tient
remaining 91
sented elsewhere, 22 in the expression of
particularly
we want
65
the objecth e
oj
.-.INITIAL,-.
©STATE©
®^<
MA NEUVER !
!
.-.SUBSEQUENT,-.
*©
if
observing 9 people for about 20 years and the
STATE
>-© nduction
®
of cohort research
The architecture
66
At
in the
next-to-last stage
this
we have
tectural operations,
archi-
reached
finally
a major activity that requires knowledge of what is generally regarded as "sta-
The data having been expressed
tistics."
exposed to each ma-
for the populations
neuver,
we can now
use inductive
inferential decisions.
various
determine
distinctions
among groups
differences
statistical
correlations)
(or
trends
compare numerical
to
In
employed
are
tests
in
or events.
strategy of selecting these tests
is
to
and the
The
beyond
the scope of this outline and will depend on such characteristics of the research project as the types of data, the distributions of data, the
number
the
number
of groups,
and
In seeking guides
of events.
the choice of analytic statistical pro-
to
cedures, clinical investigators
books or
tain
may
28, 32, 33, 3r
articles 6,
-
45
One
excellent procedures
statistical
of the problems
analyses,
chosen, pling."
statistical
is
certain
the validity of apply-
randomly based on "random sam-
that are not at
to populations
tests
all
Since the intake of people in a survey,
as discussed previously, 20 is
seldom if ever "random," investigators could not apply most statistical tests to survey data if we insisted on "random sampling."
we
To
allow
statistical
tests,
therefore,
conveniently ignore our constant violations of
the basis for the
test.
Another problem is created by the modern inversion of the concept of variance so that it can refer to the measured objects as well as to the system of measurement. The idea of "variance"
was
originally
created
in
reference
to
repeated
measurements, and each deviation from the mean contributed
modem to
to
the
mal") fied
"error
variance." 34
In
many
applications, however, variance refers not
mensurational precision in multiple measure-
ments of a single object, but
we may not be F ratios or
distribution,
m
using
t
tests,
justi-
other
"parametric" tests that depend on Gaussian
The
distributions.
effect
of \-iolating the
assumptions of "normality," although
dom mentioned
most
in
statistical
sel-
books
aimed at biologists, has recently been thoroughly discussed in texts devoted to "distribution-free" 6
"nonparametric" 15
or
Validity of transformations. Attempts
made to "normalize" a nonGaussian distribution by using logarithms, square roots, or other transformations of are
often
the original variables.
to populational dis-
decide
Since investigators
meaningful importance by comparing the data they have observed, rather than arbitrary transformagenerally
fundamental problems in statistical logic remain unresolved and are generally ignored or glossed over in most textbooks.
ing,
research do not have a Gaussian (or "nor-
2.
of the text.
for
procedures have acquired
solutions in recent years:
statistics.
used as a basis for taxonomic arrangement
available
other traditional problems in ana-
statistical
1. Parametric violations. Since most of the dimensional data encountered in clinical
particu-
valuable because the characteristics of the research data, rather than the innate distinctions of the statistical tests, are
many
Two lytic
find cer-
larly
Despite the
control"
"error variance."
new
procedures.
Analytic
A.
analysis,
single
in
"Quality
jects.
statistical
procedures for analyzing the results and
making
measurements of multiple oband "range of normal" can thus become admixed in the same conceptual procedure, so that cither the measuring devices or the people who deviate from a mean become associated with different magnitudes of
persion
tions
of the
about
data,
these changes
is
the scientific logic of
uncertain.
Are they
pref-
erable to using the alternative "nonpara-
metric"
or
"distribution-free"
tests
that
rank the observed data without altering the basic values? Suppose we have found that the cube root of the arc tangent of serum cholesterol is "significantly" lower in one group of patients than in another? Is a "significance" based on such a peculiar conversion of data really more meaningful than what could be found by ranking the results in a "nonparametric" or "distribution-free" test?
B. Inferential decision. In
all
of the pro-
had be adopted for choosing the test that would be followed by an inferential decision about "significance." The strategy depends on the following act of contracedures just cited, a
to
statistical strategy
Subsequent implementation of the
We
assume, according to the
changes,
laboratory
no difference among the groups being compared. With this assumption, we determine the probability (or P value), that the observed
venience
of
puntal logic:
"null hypothesis," that there
is
by chance. If this probor below a selected a level,
difference arose ability
at
is
we
such as 0.05 or 0.01,
reject the null
hypothesis and proclaim the observed dif-
objectii <
results,
side
and
cost
treatment,
67
con-
effects,
treatment.
of
The
statistical decision procedure, however, is geared either to regarding all targets as having equal importance or to using only one target
variable.
To
get a single answer about "statistical
abandon
significance," clinical investigators usually
a judgmental evaluation of separate decisions for each target, and, instead, compress multiple targets
into
a single variable. This compression
source of
the
many
of
"contrived
the
is
indexes"
ference to be "significant" at the calcu-
whose
lated value for P. This statistical strategy
ly-
although sanctified bv tradition and worshipped by investiga-
3. Problems of the null hypothesis. Although the two sides of a coin are often used to illustrate concepts of probability, only one side of a
of "hypothesis testing,"
tors
and
free,
however.
editors,
P
5
-
>
M
two-sided biologic issue
value strategies.
P values has been severely
cent vears. 3
44
>
Some
The
criticized
use
in
re-
writers have proposed
P values and "significance" concepts be dropped entirely and replaced bv estimations based on "confidence intervals" 5 25 or "maximum likelihood ratios." 25 3l In the avant garde today, main- statisticians advocate procedures of Bayesian analysis, which replaces conventional quantifications with subjective estimates of numerical that
'
'
probabilities.
31
,J '
>
34
Partisan
9
'
-3
>
are
theorists
still
may
procedures,
statistical
For example, in one reconducted after almost three decades of dissemination and general of
seminar," 12
the
doctrine
that
"hypothesis
statistical
clinical
trials
testing,"
eight
spent 25 pages discussing the charge,
made by another
statistician, 1
of error probabilities to
.
.
.
that "the concept
has no direct relevance
experimentation." 2.
The
restrictions of univariate targets.
Regard-
the strategy used for a statistical deabout "significance," the decision must be based on a univariate target. Although able to
less
of
cision
manage multiple
do not
variables in
the initial state of
gorical
not
suitable
for
the subsequent state.
multiple
clinical
some of target must In
the
"multivariate" techniques, the be a single variable; in others, multiple target variables are accepted but are given
the
same
rela-
"weights" or importance. Thus, a thoughtful clinician, appraising the results of treatment, might tive
to give separate evaluation to
targets
as
enable the
alter the basic intellectual restrictions 1 '
symptomatic
such multiple
response,
functional
data,
with
a
clinical
investigator
new measurements
sional data.
will
'
usually
After
all
of "continuous" dimen-
this
work
in
getting "con-
tinuous" data, however, and after calculating
all
the statistical tests of the data, the investigator
then makes the final decision about his results on the basis of a completely arbitrary pair of dichotomous categories. These categories, which are called "significant" and "nonsignificant," are usually demarcated by a P value of either 0.05 or 0.01, chosen according to the capricious dictates
want
may
be serviceable but
to
go to enormous efforts in mensuration. He will get special machines and elaborate technologic devices to supplant his old categorical statements
or
in
modifications
procedure
4.
analysis" targets
The
of
the patient, statistical techniques of "multivariate
are
errors.
null-hypothesis
Procrustean categories of decision. The method making statistical decisions about "significance" creates one of the most devastating ironies in modern biologic science. To avoid using cate-
of
themselves.
statisticians
a
of
lated procedure for testing "non-null"' hvpotiieses.
cent "biometrics
required
and P value statistics of the "null hypothesis." would have to modify the basic concepts by using confidence intervals and by applying the ideas of P error introduced by Neyman and Pearson 41 to supplement the unilateral scope
We
about
be relieved to learn that statisticians are beset with doubts about the basic "security" of the
acceptance
is
used to "prove" a difference but never a similarity. If we wanted to show that two drugs were essentially the same, rather than different, we could not do so directly with the ol levels
that arise in the absence of a specifically formu-
ignorance
procedures
statistical
these pro-
all
Clinical investigators, beset with insecurities their
explored in
is
and a "winner" has not yet emerged.
debating the respective merits of posals, 7
were described previous-
inference based on the null hypothesis, which
Uncertainties of
1.
of
trouble
not entirely
is
defects
scientific
of the statistician, the editor, the reviewer,
the
for
granting agenev.
"significance"
value that emerges
If
0.05
is is
the or
0.06,
level
lower
demanded and the P
the investigator
may
be ready to discard a well-designed, excellently conducted, thoughtfully analyzed, and scientifically important experiment, because it failed to cross the Procrustean boundary
demanded
for statisti-
cal approbation.
The
widespread
acceptance
of
these
rigid
The
68
architecture of cohort research
categories of "statistical significance"
able
is
a lament-
demonstration of the credulity with which
modern
abandon biologic wisdom
will
scientists
favor of any quantitative ideology that offers
in
the specious allure of a mathematical replacement for sensible thought.
loose ends
remain.
still
No
attention has
been given to research projects that are performed as methodologic explorations, rather than as clinical surveys and experiments. Such projects include studies of observer variability, establishment of
Extrapolation
10.
MANEUVER ^SUBSEQUENT-.
INITIAL,^
._.
©STATE©
*(Z)
1
!
STATE
(D
>—©
implementation of "double-blind" techniques in circumstances where "blindness" seems either difficult or impossible to attain.
Important methodologic issues
management
or in
clude:
prognostic stratification,
in
tactics
arrangement
chronologic
appraisals of "intrusion,"
The
transstratifieation.
elsewhere. This all
the time lor reviewing
is
the objective- of the research clearly
Was
Specified?
stated
the
objective
im-
plemented by procedures that carried out those specifications? Were the comparative maneuvers properly chosen and allocated?
Was
the population adequately identified
and assessed for bias in intake? Were the data obtained by suitable methods, classified with stipulated criteria, and analyzed
Were
appropriately?
population?
and accounted
W ere 7
What
been but briefly mentioned, and only scant
If
of these
answers,
formed a tific
If
for
the results suitably
else
might have gone
questions receive satis-
we have probably
terms
other
statistical
phrases
that
designated improperly in
name
are:
Aldiough
this
and
of sig-
have already been cited, many have been omitted. Among
nificance
often
are
or concept
normal, control, regression, negative
result,
standard error, and sampling error.
All of these topics are available for con-
sideration in
and may ultimately appear
(
our future sessions here.
References 1.
Anscombe, F.: Sequential medical book review, J. Amer. Statist. Ass.
not,
we may have
•
3.
Armitage,
outline
now been
of
per-
biostatistical
many
a
58:365-
Blackwell
Bakan, D.: The
medical
Scientific
trials,
test of significance in
Psychol.
research,
Bull.
Ox-
Publications.
psycho-
66:423-437,
1966.
Berkson, survival
J.,
and Gage, R. for
rates
P.:
Calculation of
Mayo
cancer,
Clin.
Proc.
25:270-286, 1950. 5.
Boen,
J.
R.:,
P
values
versus
standard errors in reporting data,
208:535-536, 1969. (Letter 6.
concluded,
Sequential
P.:
1960,
logical
4.
•
trials;
383, 1963. 2.
Bradley, tests,
architecture has
inadequacies
intellectual
other
per-
formed yet another exercise in statistical numerology, and the next stage in the work should be not an extrapolation, but an architectural return "to the drawing board." •
or
linguistic
valid, reproducible act of scien-
research.
in-
Although some of the
vocabulary.
tical
ford,
all
many
tellectual "pollutants" contained in statis-
wrong? factory
uncertainties about
inductive statistical procedures have
discussion has been given to the
prognostic distinctions of the
stratified for
strategies of
the various forms of
"intrusion" recognized in the analyses?
"spection,"
of
and
the words sample, error variance,
the previous steps:
Was
many
in logic
have received Such issues in-
of data
minimal attention.
only
Now we have reached the last stage. We have made decisions about the numerical distinctions of the data, and we would like to draw conclusions that will enable the results to be extrapolated and used
cri-
choice and validation of indexes, and
teria,
J.
V.:
Englewood
to
means and A. M. A.
J.
editor.)
Distribution-free Cliffs,
N.
J.,
statistical
1968, Prentice-
Hall, Inc. 7. Bross,
D.
J.:
Applications
of
probability:
Subsequent implementation of the objective
Science
pseudoscience,
vs.
Amer.
J.
lems of statistical surveys, Arch. Intern. Med. 123:171-186, 1969.
Statist.
Ass. 64:51-57, 1969. 8.
W.
Cochran,
Amer.
ies,
9. Cornfield,
Matching
G.:
in analytical stud-
C.
determination
A method number
the
of
&
trial,
S.
and
J.,
Ederer,
method
utilization of the life table
J.
13.
hypothesis
of
testing
clinical
in
W.
and Tomiyasu,
J.,
trial
a
of
A
U.:
28.
29. S.,
con-
high in unsaturated fat in preventing complications of atherosclerosis, Circulation 40:(Suppl. 2) 1-63, clinical
trolled
and Massey, F.
J.,
J.,
Intro-
Jr.:
duction of statistical analysis, ed. 3, New York, 1969, McGraw-Hill Book Co., Inc. R.: Clinical judgment, Balti15. Feinstein, A. more, 1967, The Williams & Wilkins Com-
Tolchinsky,
R.
An
E.:
evaluation
B.,
A.,
J.
of
the
D. S.: The field trial: Some on the indispensable ordeal, Bull. N. Y. Acad. Med. 44:985-993, 1968. Freeman, L. C: Elementary applied statistics, New York, 1965, John Wiley & Sons, Inc. Good, I. J.: A subjective evaluation of Bode's law and an 'objective' test for approximate numerical rationality, J. Amer. Statist. Ass.
30.
Grubbs, F. E.: Sample criteria for testing outlying observations, Ann. Math. Stat. 21:27-58,
31. Hacking,
Logic of
I.:
identification
of
rates
epidemiology.
II.
Ann.
disease,
III,
Elementary statistics, Springfield, C Thomas, Publisher.
H.:
1968, Charles
trials,
34.
A.
clinical
epidemiology.
Clinical
R.:
design
of
statistics
therapy,
in
ments, Clin.
statistical analysis of clinical
39:294-310,
Anaesth.
Statistical
L.:
theory:
The
1967. relation-
Clinical biostatistics.
II.
Sta-
& Unwin.
E., and Dowd, Cancer epidemiology: Methods of study, Baltimore, 1967, The Johns Hopkins Press. 36. MacMahon, B., Pugh, T. F., and Ipsen, J.:
35. Lilienfeld, J.
versus science in the design of experi-
tistics
Hogben,
The J.
don, 1957, George Allen
III.
Ann. Intern. Med. 69:1287-1312, 1968. 18. Feinstein, A. R.:
Brit.
ship of probability, credibility and error, Lon-
Med. 69:1037-1061, 1968. 17. Feinstein,
Cam-
Cambridge University
Press.
32. Heath,
The
Intern.
statistical inference,
bridge, England, 1965,
33. Hill, G. B.:
pany. 16. Feinstein, A. R.: Clinical
Pharmacol. Ther. 11:282-292,
A.
M.,
Pedersen,
E.:
Epidemiologic methods, Boston, 1960,
Little,
Brown & Company.
1970. A.
19. Feinstein,
Clinical
R.:
biostatistics.
III.
37. Mainland,
The architecture of clinical research, Clin. Pharmacol. Ther. 11:432-441, 1970. A.
20. Feinstein,
The
Voight,
1950.
W.
Dixon,
The
F.,
64:23-49, 1969.
diet
1969. 14.
R.
thoughts
trials,
Chron. Dis. 19:857-882, 1966. Pearce, M. L., Hashimoto, S.,
Dayton, Dixon,
Korns,
T.,
M., Hemphill, F. M., Napier,
27. Fredrickson,
sion
role
methods and scien-
Statistical
(Suppl.), 1955.
J.
by Ederer, F., Zelen, M., Shaw, L. W. and Beebe, G. W. ): Biometrics seminar: The
study of Chron. Dis. 20:13-27, 1967.
J.
1954 poliomyelitis vaccine trials summary report, Amer. J. Public Health 45: (Part 2) 1-63
in analyzing
Chron. Dis. 8:699-712, 1958. Cutler, S. J., Greenhouse, S. W., Cornfield, and Schneiderman, M. A. (with discusJ., survival,
12.
and
Maximum
F.:
rheumatic
Boyd, Ltd.
Boisen,
Lancet 2:1357-1358, 1966.
Clinical
E.:
inference, ed. 2, Edinburgh, 1959, Oliver
26. Francis,
of patients to include in a controlled clinical
11. Cutler,
Stern,
prospective epidemiologic
R. A.:
25. Fisher, tific
and Downie, C. C:
J.,
rapid
the
for
and
R.,
105 episodes,
Biometrics 25:617-657, 1969. 10. Clark,
A
fever:
Discussions
H. O., Kempthorne, O., and Rubin, H.:
A.
effects of recurrent attacks of acute
The Bayesian outlook and its by Geisser, S., Hart-
J.:
application. lev,
24. Feinstein,
Public Health 43:684-691, 1953.
J.
69
architecture
of
biostatistics.
research
clinical
Pharmacol.
Clin.
tinued),
Clinical
R.:
Ther.
IV.
fication of
Chron.
R.: The pre-therapeutic classico-morbidity in chronic disease, J.
Dis., 1970.
(
C.
R.:
II.
The
J.
The epidemiology clinical
course:
temporal demarcations, 123:323-344, 1969. 23. Feinstein,
A.
R.,
and
A.,
and Schimpff,
trated
ficiency
40.
Data, decisions, and Intern.
1963,
statistics,
W.
B.
Saunders
M., and Shulman,
L.
E.:
Determi-
by systemic
in
retrospective
studies,
design
Amer.
J.
efJ.
Epidem. 91:111-118, 1970. Mosteller, F., and Tukey, J. W.: Data analysis, including statistics, in Lindzey, G., and Aronson, E., editors:
Med.
illus-
erythematosus,
lupus
Chron. Dis. 1:12-32, 1955. 39. Miettinen, O. S.: Matching and
of cancer therapy.
Arch.
Elementary medical
nation of prognosis in chronic disease,
In press.
22. Feinstein, A. R., Pritchett,
D.:
Philadelphia,
Company.
11:595-
A.
2,
38. Merrell,
(con-
610, 1970. 21. Feinstein,
ed.
gy,
ed.
2,
Handbook
Reading,
of social psycholo-
Mass.,
1968,
Addison-
Wesley, Inc. H.:
Spitz,
demiology of cancer therapy.
I.
The
epi-
Clinical prob-
41.
Neyman, J., and Pearson, E. S.: On the problem of the most efficient tests of statistic-:!
The architecture
70
of cohort research
A
Roy. Soc,
hypotheses, Philos. Trans.
Rheumatic fever
231:
A
289-337, 1933. 42. Pike,
M. C, and Morrow, of
analysis
demiology, 43.
R.
J.
Prev. Soc.
in
clinical
epi-
in children
epidemiologic laxis,
and adolescents: study
of
subse-
streptococcal infections, and
sequelae.
V.
Relationship
of
the
rheumatic fever recurrence rate per streptococ-
Med. 24:42-44,
1970.
cal infection to pre-existing clinical features oi
Remington, R. D.: How mam experimental subjects? Or, one good question deserves an-
5) 58-67, 1964.
other, Circulation 39:431-434, 1969. 44.
quent propln
Statistical
studies
patient-control Brit.
H.:
long-term
Rozeboom,
\V.
\V.:
The
hypothesis significance
fallacy
test,
Nonparametric
S.:
behavioral
sciences,
New
the
null-
Psychol. Bull. 57:
statistics
for
1956,
York,
the
Mc-
Graw-Hill Book Co., Inc. Statistical 46. Snedecor, C, and Cochran, \V. methods, ed, 6. Ames, Iowa, 1967, Iowa :
State University Press. 47. Taranta,
Wood,
A.,
II.
Weinberg,
F.,
Ann.
Intern.
Med. 60:(Suppl.
Taube, A.: Matching in retrospective studies; sampling via the dependent variable, Acta Soc. Med. Upsal. 73:187-196, 1968. Administration Co-Operative Uro49. Veterans logical Research Croup: Treatment and sur\ i\ ill
Surg,
Feinstein,
A.
R.,
Tursky, E., and Simpson, R.:
of patients with cancer of the prostate,
(aiicc.
50. Yerushahny,
Obstet. J.,
124:1011-1017, 1967.
and Palmer, C.
methodology of investigations of tors
E.,
patients,
48.
l
416-428, I960. 45. Siegcl,
the
in
chronic
27-40, 1959.
diseases,
J.
E.:
On
the
etiologic fac-
Chron.
Dis.
10:
CHAPTER
6
Sources of 'transition
The idea
of a cohort
is
constantly used
as a basic tactic in biostatistics.
contemplate
derived
statistics
occurrence
thrombophlebitis in recipients of the
of
"pill,"
the survival rates after different forms of
treatment for acute myocardial infarction,
development of vascular complications
the
diabetes
in
weight gain
mellitus,
or
in the first
of these circumstances, of people to see
the
anticipated
year of
we
life.
In
all
follow a group
what happens
to
textbooks.
In
often used for teaching or for reference in medical activities, the word cohort does not appear in the index, and seems absent from the text. In two new biostatistical
epi-
course and treatment from ontogenetic studies of normal growth, the fundamental scientific logic depends on the investigation of a cohort. A cohort is presumably the source of statistical data dealing with such phenomena as the appearance of lung cancer the
statistics
from
clinical studies of the
smokers,
most
in
many books
of disease, or
cigarette
sistently
When we
demiologic studies of cause of disease, from
in
bias'
them
after
G ~ 10, 23 - 27 ' 28, 32 - 36 37 '
1 -
that
are
textbooks, cohorts are described with ver-
bal illustrations but are undefined, and the illustrations
on a
are based, in one instance 34
,
and in on a study of weight
life-table analysis of mortalitv,
the other instance, 35
During a nonexhaustive search of the I was able to find the term cohort defined in onlv two books that had gain.
literature,
"statistics" (or
An
some congener)
in the title.
additional set of definitions
was
avail-
able (as might be expected) in several text-
books of epidemiology. The diverse concepts
expressed
in
these
six
sources
noted in Table I. seems to be consistent in
of
definition are
What
they have been exposed to something: an alleged cause of disease, the action of an es-
definitions
tablished disease, the intervention of a thera-
observation of a group of people, followed
peutic agent, or the course of time
forward (or "prospectively") in time. What seems inconsistent are the purposes for which the group is being followed, the
Despite the
on
this
is
concepts that depend
tvpe of scientific reasoning and bio-
statistical
hort
many
itself.
interpretation, the idea of a co-
usually undefined or defined ineon-
is
specifications
Clin.
Pharmacol. Ther. 12:704, 1971.
these
for inclusion
in
the group,
and the reference date from which the follow-up period begins. In two definitions, the use of a cohort
—
This chapter originally appeared as "Clinical biostatistics A. Sources of 'transition bias' in cohort statistics." In
all
the principle of longitudinal
is
cause of disease, but the
1
limited to studies of in
other definitions
purpose of the research
is
unrestricted.
71
72
The architecture
Table
I
of cohort research
Author
Definition of a cohort is made of the occurrence of an event, suspected of being the cause of the lesion, and the subject is followed forward to see whether the effect is or is not
"In the prospective or cohort study, a record
Doll'
produced." Fox, Hall, and Elveba< k
(
p.
73)
Three definitions air offered or implied: ot persons born in the same \ ear or five-year period" (p. 191) "Related prisons (comprising two or three generations) whose past
:
"Croups
disease experience can be well
documented and where future disease
experience can be observed" (p. 197) "Population segments bom at the same time" (p. 261)
M.u Mahon, Pugh, and Ipsen 29
"The
investigation over time of an identified group of individuals"
16)
(p.
Ma< Mahon and Pugh 30
(The cohorts)
are defined in terms of characteristics manifest prior
the appearance of the disease
under investigation (and) are observed over a period of time to determine the frequency of the disease among them. (p. 207) to
"Croups
Mainland 11
.
.
same year or within
of individuals born in the
.
a
few years
of each other" (p. 142)
Morris 33
At
least
"The
two
definitions are offered:
clinical follow-up,
over
many
lations' of suitable individuals" (p.
years
if
necessary, of whole 'popu-
131)
of people bom in a defined period is called a cohort; and study of the future history of such a group, of what happens to
"A group a
them,
is
called 'cohort analysis'."
(
p.
232
Several definitions require that the cohort
a single cohort
members be
trapolation, or
when two
pared for the
effects of the
similar in age, but other defi-
contain
nitions
no such chronologic de-
mands. Since the textbooks of statistics and epidemiology have either failed to define a cohort or have offered discordant definitions, we should not be surprised to find that the cohort concept has been used with
many
this discussion
of
My
logical discrepancies.
scientific
cohort
is
to consider the principles
logic
statistics,
object in
that
and
scientific errors that
are
to note
may
suitable
some
occur
for
of the
when
the
A. Objectives
in
cohort
research
Since the validitv of biostatistical data
mav be
destroyed by bias in the initial selection or subsequent examination of a
popuhu the
way
a crucial feature of
i,
s
a cohort
chosen and maintained.
is
When
appraised for general excohorts are com-
maneuvers to which they were exposed, the investigator (or reader) must constantly beware of un-
recognized bias in the procedures used to assemble the members and to obtain the data of the cohorts.
We
can contemplate some of the neces-
sary logic and potential sources of bias
we
recall the
INITIAL
STATE that
principles are violated.
is
if
sequence of
MANEUVER SUBSEQUENT >
STATE
was previously described 17
as a basic
guide to the architecture of clinical research. With this sequence in mind, we can consider the purposes for which cohorts are assembled, and the ways in which bias can enter the results.
A
cohort can be assembled to investigate
at least four different types of
maneuver:
Sources of 'transition bias'
pathogenetic,
ontogenetic,
pathogressive,
interventional. In an ontogenetic sur-
and
73
ing the decision to give "no treatment." Accordingly, a modern investigator studies
maneuver is time itself, and the be noted is the growth or development of a normal person during a selected
assumes that no mode of treatment may have been effective; he then combines the
period of time. In a pathogenetic survey,
results of all patients, regardless of treat-
vey, the
natural history
effect to
He
maneuver
the
is
exposure to an agent that
allegedly causes a disease, and the effect
is
development of that disease. The maneuver can be selected by the exposed person (as in cigarette smoking), imposed by "nature" (as in airborne infection, natural disasters, and "normal" or degenerative changes with age), or derived from the
medical experiences (as in diverse "iatro-
maneuver is exposure to an estabdisease, and the effect is the change
lished in the
disease takes
The sists
to
condition as the
affected person's its
course.
activity that
is
called therapy con-
of an interventional
maneuver intended
change what might happen after expo-
sure to a causal agent or during the course of a disease.
The
effect
is
determined by whether an
noting, in prophylactic therapy,
anticipated occurrence
is
prevented, and in
whether
an observed manifestation is altered. Thus, in a survey of therapy, the investigator determines remedial
therapy,
what happened to a cohort of patients who have already been treated according to the ad hoc decisions of individual clinicians and patients; and in a trial of therapy,
nil hypothesis.
ment; and he studies the "clinical course" the cohort obtained by the combina-
in
tion. 18
Having determined indexes of accomplishment and demarcations of different prognostic strata for the entire group of patients, regardless of treatment, the in-
vestigator can then appraise the results of
treatment within the strata, and can use the results for designing future trials.
A
genic" ailments). In a pathogressive survey, the
by making a
cohort can thus be studied for the ob-
servation of cause, course, or intervention,
For a cause-cohort, the objective
is
to de-
termine whether a particular disease develops after exposure to an agent allegedly causing the disease. For a course-cohort, the objective
is
to determine the path of
nature in the normal development of a healthy person or in the outcome of an established disease, with or without treatment. For an intervention-cohort, the objective
to determine
is
maneuver
alters
whether a particular
what would otherwise
oc-
cur in the course of normal growth, ex-
posure to disease, or course of disease.
The key research
is
issue of scientific logic in cohort
the validity of the comparisons
that are performed internally ly.
In
studies
cause
of
the investigator can assign the treatment
where the
according to a prearranged experimental
are contrasted, the
plan.
place "internally"
and
or
external-
intervention,
maneuvers main comparison takes
of different
effects
among
the cohorts
who
and
maneuvers. At least two cohorts must be assembled one that is exposed to the particular causal or inter-
needed for satisfactory deand evaluation of experimental therapeutic trials. 12 15 Although an ideal cohort for such surveys would consist of "un-
maneuver under consideration, and another cohort that remains either unexposed or exposed to some alternative maneuver. Thus, in patients with disease D,
One surveys
of the is
main purposes of therapeutic
for investigators to discern the
appropriate
types
of
data,
indexes,
stratifications
sign
'
treated" people in
whom
"natural history"
receive
those
—
ventional
if
treatment
TA
for cohort
could be observed, such "untreated" co-
results
than treatment
seldom attainable, particularly for the major chronic diseases that constantly receive diverse forms of treatment, includ-
would
first
horts are
TB
CA
gives better
for cohort
C B we ,
want to be sure that the two cohorts were similar enough so that the
treatment, rather than a difference in co-
74
The
horts,
could be held responsible for the
architecture of cohort research
difference in the results. After this 'inter-
nal" comparison
want
we would
completed,
is
perform an "external" comparison
to
however, the people whose target events are counted in the numerators are a temporallv delayed subset of the same people
who were
previously counted in the de-
to extrapolate the results to the- larger pop-
nominators.
which the cohorts were derived. For this purpose, we would want to ascertain the similarity of cohorts C A and
the de-nominators. hort statistics provides a role not merely
CB
for the
ulation from
to other patients with disease D. If the
cohorts were sufficiently representative of
we might he
the diseased population, to conclude that
TA
better treatment for
is
people with disease D, and not
group
CA
of patients
who
just for the
constituted cohort
This type of external projection
.
the onlv form of "comparison"' course,
is
often
in studies oi
where maneuvers are not compared
and onlv
a single cohort
Sources of bias
B.
able
in
is
observed.
The numerators
"evolve" from
This temporal-subset distinction of co-
numerator and denominator of the but also for the virgule (or
statistical ratio,
that separates numerator from denominator. The virgule "contains" the time interval that elapses from initiation of the maneuver in the denominator population until the appearance of the target events counted in the numerator. "fraction
The
line")
virgule thus "includes" several crucial
features of a cohort: the initiation of the
maneuver, the performance of the maneuand the observation of the population
cohort statistics
ver,
The
of an observed cohort are expressed usually statistically as a ratio, or
thereafter.
The denominator of this ratio contains the number of people initially in the cohort. The numerator contains the num-
tinctions arc violated, the temporal
results
rate."
ber of those people in
occurred
event
target
For
example,
velops the
3
in
if
of
the
"pill,"
whom
the chosen
a
at
date.
later
thrombophlebitis 1,500
women
de-
receiving
thrombophlebitis
rate
3/1,500 or 0.2%; if 60 of 200 people with a particular cancer survive for three is
years, the three-year survival rate
30%. As a general
or
the
number
expression,
of people in cohort
60 200
is
,
RA
Because of the forward way cohort
,
which can be called "chronology
bias,
arises
when
the cohort
cross-section of people in
may have begun
ver
their
clinical
course,
This type of bias
if
ceive
e A /n A
in
which a
.
is
assembled
bias," as a
whom the maneu-
at different times in
so that the virgule
does not represent the inception of the maneuver for each member of the cohort.
and
is
changes can become disfundamental type of
statistics
One
torted or biased.
spread in cohort
followed, this statistical ratio has
is
implied by cohort
is
the target event later occurs in e A people, the rate of the target event,
these biological and logical dis-
nA
if
CA
When
a
paper of
so important
is
statistics
separate
that
discussion
this series.
A
and wideit
will re-
the next
in
second fundamental
type of bias, which can be called "transition bias," arises
aspects of the
from problems
way
in various
that people are trans-
unique logical distinctions that are not
ferred from their anonvmity in a general
present in any other types
population to their prominence in the data
data.
For the
statistical
of statistical
ratios in all other types of
"samples," the numerators
and
denominators can be assembled in diverse manners and can be related to each other in a variety of ways. In cohort statistics, °I
shall
not attempt to differentiate
among
the terms
and proportion. The distinctions are seldom honored, even by people who know the differences, and ratio,
rate,
the three terms are used interchangeably, with rate being most popular.
for a statistical cohort.
and
difficulties
The
diverse sources
of transition bias
are the
topic for the rest of the discussion here. C.
The problems of transition bias
In order to contribute to the a cohort, each of
go
six
its
statistics of
members must under-
major populational transfers: from
the general population to the base popula-
Sources of 'transition
people with the particular condition under study in the denominator; from that base population to the parent population tion of
75
bias'
epidemiologic setting achieved with a field survey or with mailed questionnaires. Thus, if
the cohort statistics are based on treat-
from which the cohort ( or cohorts ) is ( are derived; from the parent population to the
of a disease at a particular hospital, the parent population consists of the hos-
candidate population that
pitalized patients with that disease;
eligible
is
for
ment
if the obtained by mailing question-
"admission" to the cohort; from the candi-
cohort
date population to the particular cohort
naires to a group of healthy adults to ask
been exposed to a maneuver and that becomes counted in the denominator; from the state before inception of the maneuver to a postmaneuveral state in which the target event may occur; and from occurrence of the target event to the detected target that is counted in the nu-
about their smoking habits, the parent population consists of the people who have
that has
merator.
The
is
been sent the questionnaire. Not all members of a parent population enter into cohort statistics. Because of various types of eligibility
many members
some or
criteria,
of a parent population
may
be excluded from the candidate population
first
three of these transfers occur
that enters the cohort(s). Thus, in choosing
assembled, and are not
treatment for a particular disease, physi-
before the cohort
is
expressed in the cohort data. of a general population
The members
must have or must
may
cians sick,
exclude patients
who
are too
too uncooperative, or otherwise un-
develop the particular condition that brings
suitable
them
studies of cause of disease, the people
under con-
into the base population
sideration.
If
the condition
a state of
the
for
proposed
therapy.
In
who
then the base population consists of those
have received a questionnaire determine their own eligibility for the candidate population by deciding to complete and to
who
return the questionnaire. After these ex-
is
health, as in studies of cause of disease,
members
of the general population
are in the appropriate state of health. If
clusions have
the condition
of the general population
is
a state of disease, as in
studies
of pathogressive course or thera-
peutic
intervention,
consists
of those
population
the
base population
members
of the general
who have developed
priate state of disease. Thus,
ver under study
is
if
the approthe
maneu-
the effect of cigarette
smoking on healthy adults, the base population consists of those people in the general population If
the
who
are healthy adults.
maneuver under study
is
the effect
the base population consists of the anginal
members
of the general population.
cohorts
from
sel-
cohort.
an
with
The maneuvers
intervention-cohort
associated
are
usually
assigned ad hoc by a physician or planned
by an
investigator; the
maneuvers
associ-
ated with a cause-cohort have often been
chosen by the members of the cohort and are noted from their responses in a questionnaire or interview.
Since the denominator of cohort statistics is
not demarcated until the candidate popu-
been associated with the maneufrom general population
ver, the transfers
to base population, to parent population to
candidate population are not accounted for in cohort data. Nevertheless, these transfers
reach a milieu
profoundly affect the extrapolation of the results noted in cohort statistics. Unless the candidate population adequately represents its antecedent populations, the cohort data
population.
who
that provides the opportunity for the
mem-
be investigatively observed. This milieu is often a clinical setting, such as a
bers to
it
to the candidate population that en-
a
The those members
parent
of the base population
but
ters
investigators usually choose
a
parent population contains
hospital,
them
lation has
Since an entire base population can
dom be studied,
negotiated the various transfers that lead
angina pectoris,
of surgical treatment for
been completed, the members have successively
also can
be the type of
may
not be validly applicable to any grouo
The architecture
76
of cohort research
people beyond the particular persons w ho were observed. The "internal" comparison of the cohorts may be valid within
of
sisted
exclusively of patients with metas-
tases,
we would immediately The
the comparison as unfair.
recognize contrast of
the candidate population, but a bias in the
treatment would be biased because the two
assembly of the candidate population may prevent the results from being extrapolated to any members of the "external" popula-
had unequal prognostic expectawas applied. Now suppose we had a mixture of localized and metastatic patients in the two cohorts. The cohorts would still be biased unless the mixture was proportionately ecpial in the two strata (or subgroups) that compose the cohorts. For example, let us assume that neither surgery nor radiotherapy has any effect on the natural course
tion
the world
in
For
cohorts.
beyond the immediate
this reason, the final interpre-
depends on cru-
tation of cohort statistics cial
features of populational transfers that
noted or contemplated
are not
in
anv of
numerical data that constitute the statistical results. The problems entailed in tin
and reducing the bias created
detecting
by
pre-cohort
these
reserved
transfers
discussion
for
be
will
later
this
in
essay.
cohorts
tions before treatment
of the cancer, but
15%
metastatic patients, whereas the radio-
therapy candidate population has been
\lter the
and rates
the statistical
lows:
are
transfer
cited
in
and can be contemplated directly. be biased by factors
ratios,
These
affecting the denominator, the numerator,
virgule of the statistical
the denominators, bias
is
ratios.
maneuver
are
shown
to
to
maneuver
begins.
In
is
created unless the
cohorts have equal opportunity for the target event to be detected
Major
when
it
occurs.
anv of these three feasusceptibility to target, performance tures of maneuver, and detection of target can produce a transition bias that will impair disparities in
—
—
the scientific
\
Radiotherapy cohort:
=
alidity of the comparisons.
therapy cohort.
70% in those with a localized 20% in those with metastatic
Now
suppose
we wanted
to
com-
the survival
seem more than
If
we were unaware
of
between localized and metastatic patients, and if we were unaware of the disproportionate mixture of these two groups of patients in the cohorts, we would conclude erroneously that surgery was substantially better treatment than prognostic
differences
radiotherapy.
The hazard
search.
tumor, and
make
IV2 times higher than that in the radio-
tion 1
disease.
=
= 40%.
rate in the surgery cohort
survival rate for patients with a particular is
(.15) (.20)
(.40) (.70) + (.60) (.20)
.28 + .12
metastatic patients would
Equal susceptibility to the target event. Suppose the expected three-year 1.
cancer
+
62.5%
disproportionate mixture of localized and
the \irgulc, bias
is
=
ol
created unless the cohorts have ecpial performance of the compared maneuvers. In
numerators, bias
(.85) (.70)
.595 + .030
have equal sus-
event before the
the target
ceptibility
localized
Thus, although the two treatments had no effect on the course of the cancer, the
cohorts compared to assess the effects
the-
Surgery cohort:
In
created unless the
40%
The survival would be as fol-
two cohorts
the
in
contained
metastatic patients.
transitions can
or the
a
cohort
60%
assembled, the remaining types of populational
suppose the surgical co-
hort contained S5 r r localized patients and
of this type of error
reason for recognizing prognostic "'
strata
as If
crucial
a
a
feature of cohort
cohort can
of people with
prognoses
(i.e.,
event), then the
is
the
stratificare-
be divided into
distinctly
different
susceptibility to the target
compared cohorts
will
be
pare the results of surgical treatment ver-
biased unless thev haye similar proportions
sus radiotherapy. If the surgical cohort con-
of the prognostically disparate strata.
sisted
e.
lusively of patients with localized
cancer, and
if
the radiotherapy cohort con-
The mathematics of this situation can be demonstrated, with the aid of some simple
Sources of 'transition
algebra,
as
Suppose that people
follows:
with a condition
or
D, can be S, and S 2 with
disease,
divided into two strata,
,
target event for those
two
two main cohorts
be the product A k A r
will
AkA r
The value when two
the division of cancer patients into a lo-
amounts of patients from two
rj
,
70%
calized group, with expected
survival,
and a metastatic group, with expected 20% Now suppose that we assemble cohorts A and B as a mixture of patients from the two strata, Sj and S 2 In cohort A, let k A be the proportion of patients from Si; l-k A will then be the proportion of patients from S 2 For cohort B, the corresponding proportions for S and S 2 will be k B and l-k B The expected rate of occursurvival.)
.
.
x
.
A
rence of the target event in cohort
will
then be
R A = kA rj + (l-k A )r 2 = k A (r!-r 2 For cohort B, the rate will be Rb
=
kn^ + (l-kB )r 2
=
+
)
r2
.
r,.
two were the same in both cohorts, so k A = k B the two cohorts would have
the proportionate distribution of the
If
strata
that
,
=
with R A Rb regardof the individual values for ri and r 2
similar expectations, less
.
Alternatively,
if
the two strata, Si and S 2
were prognostically target rates of
r,
=
similar, r2
,
,
with identical
the terms containing
would be zero, and the values for R A and R B would be equal, regardless of any disproportions in the values for k A and k R But if the two strata have unequal target rates, so that Ti and if the r2 (r x -r 2 )
.
^
,
^
will differ. in the
k B the ,
By
results for
RA
and
A
preceding equations,
R A -R B =
(k A
we
I ).
Thus, in the preceding example for cancer patients, r t
-
r.
=
k A - k B = .85 - .40 = R A - R H = (.50) (.45)
.70
-
.45.
.20
The
=
.50
and
difference
= .225 = 22.5%, which is the difference between the 62.5% and 40% noted earlier for the two cohorts. The general principle here can be stated as follows. If A k = |k A - k B is the difference in proportional distribution of the two strata in two main cohorts, and if A r = - r 2 is the difference in the rates of the |r |
t
|
cohorts contain disproportionate distinctively
important not just for the issue of cohorts, but also to avoid
is
bias in
compared
misleading conclusions about the results of
maneuvers that have opposite
on same
effects
different prognostic strata within the
cohort. 15
For example, returning to the earlier, suppose the surgery cohort and the radiotherapy cohort each contained 50% localized and 50% metastatic patients. Suppose, however, that surgery raises the survival rate by an cancer population described
15%
in the localized patients
and lowers survival by a decrement of 15% in the metastatic group; whereas radiotherapy has exactly the reverse
effect.
Thus,
the survival rate after surgery would be
85%
5%
in patients
with localized cancer and whereas radio-
for metastatic cancer;
therapy would reduce survival to localized patients in
and
55% in 35%
raise survival to
The survival would then be
metastatic patients.
in the surgery cohort
rate
= = 45%
(.85)(.50) + (.05) (.50) (.425) + (.025)
The
survival
in
rate
the
radiotherapy
cohort would be (.55) (.50) + .(35)(.50)
=
(.275) + (.175)
Consequently,
if
we
=
45%.
looked only at the
get
-kB )(r -r 1
.
the bias introduced
consideration of different prognostic
strata
RB
subtracting the entities noted
is
different prognostic strata.
strata are disproportionately distributed, so
that k A
of
increment of
kB (r!-r 2 ) +
then the
strata,
baseline difference in expected rates for the
and r 2 for the target event. (An example would be corresponding expected rates,
77
bias'
"For two cohorts comprising people from three disdifferent prognostic strata, with target rates n, Let k.\ r;, and r3, the calculations would be as follows: be the proportion of n-type patients and m\ be the proportion of r—type patients in Cohort A. The proportion of n-type patients will be 1-kA-im. For Cohort B, the corresponding values will be kB, niB, and 1-kB-mB. The expected target rates would be For Cohort A, Ra = kAri + niArj + (l-kA-mA)rj. For Cohort B, Rb = kiin + mBrs + ( 1-kB-ms ) ra. By subtraction, the difference in cohort rates is
tinctively
Ra - Rb
=
(kA-kii)
(n-r.i)
+
(iriA-mii)
(r--r:i).
Thus, if Ak and Am, respectively, represent |kA-kn| and |mA-mB|, and if Ar t and Ar 2 respectively, represent |ri-r:>| and |r::-r?|, the bias can be specified as AkAr, + AmAr ; If only two prognostic strata are under consideration, the bias reduces to the AkA r noted previously. ,
.
The architecture
78
of cohort
research
total survival rates, ignoring a prognostic stratification,
we would
conclude
erro-
neously that surgery and radiotherapy had similar effects in the treatment of cancer,
a cohort found in 1940.
The
rates in the 1970 cohort
better survival
might then
reflect
earlier stages of detection rather than im-
although the actual effects were completely opposite in the two prognosticallv differ-
proved methods of treatment. 2. Equal performance of maneuvers. Another type of bias that is also commonly
ent strata.
overlooked
These problems in prognostic stratification can affect any type of cohort research. regardless of whether the study deals with
cohort statistics
is caused by performance of compared maneuvers. For example, in manv
in
inequalities
in
the
intervention, cause, or course. In surveys of
surveys of treatment for acute myocardial infarction, the patients who received anti-
intervention, a prognostic
coagulants were also treated with elastic
cohorts
is
needed
when
arise
physicians
chosen
the cancer
for
is
of
treatment
assign
surgery
if
is
is
pref-
patients
"operable-"
with localized cancer and
when
stratification
avoid the bias that can
For example-,
selectively.
erentially
to
seldom offered
metastatic, a surgical
cohort will inevitably be biased for comparison with a cohort that receives radio-
therapy or any other
mode
of therapy in
which patients are not preselected
for their
"operabilitv."
In pathogenetic surveys of cause of disease, the issues of prognostic stratification
are
more
difficult to discern
and
to identify
— mainly because epidemiologists have genon the produced
erally obtained so little information
The bias to beware is by patients' self-selection of the "causal" maneuver. Suppose that people who are "tense" have a lower life expectancy than people who are not. Suppose also that subject.
tense people are
more
likely
cigarette smokers than people "tense."
smokers,
may
to
become
who
are not
Compared
to
a
cohort
of
cigarette
a
cohort
nonsmokers
of
thus be disproportionately large in
its
quota of "tense" people. The subsequent comparative reduction in life expectancv of the smokers might then be due to the reduction caused by "tension" rather than bv smoking. In sur eys of course of disease, a prognostic
str. rification is necessarv to avoid misleading results in comparison of target
rates
from one era to the
next. Thus,
with
better techniques of "early" cancer detection,
a cohort of patients in 1970 might
contain
many more
localized cancers than
and chair
stockings, early ambulation,
rest;
whereas the "control" group generally was treated with bed rest alone. Conscquentlv, the
differences
the
in
physical
rather than
the-
coagulants,
may have been
therapy,
presence or absence of antiresponsible for
outcome of the two cohorts. Another example of this tvpe of bias would be the comparison of two different surgical operations, of which the first is performed by a highly skilled surgeon working with excellent anesthesiology support, whereas the other operation is done by a less skillful surgeon working in a subdifferences in the
optimal
anesthesiology environment. In such circumstances, the better short-term
outcome
in patients
who
operation might have
received the
little
to
particular distinctions of the surgerv
A
particularly
treatment that
(
important
first
do with the itself.
problem
in
such as diet or oral medication
must be self-administered by a patient
caused by selectivity in the decisions of patients to maintain or reject the prescribed
is
treatment. Suppose the natural rate of improvement for a particular condition is 30%, and suppose this rate is raised to
70% when ment A
patients
receive either treat-
or treatment B.
Suppose
further,
however, that treatment A is difficult to maintain (because of discomfort, poor taste, or some other feature) and is abandoned by 50% of the people who start it, whereas treatment B is more easily maintained and is continued faithfully by 90% of
the patients.
Suppose
have not investigated the
finally
that
we
fidelity of maintenance for the two treatments, and that
Sources of 'transition
we
are
results
unaware of these distinctions. Our would show the following rates of
improvement:
who than
are "tense" are likely to die earlier
who
people
are
"relaxed."
RA =
(.30) (.50)
=
(.30) (.10)
(
=
.70
.70
(
) (
.50
be much more likelv to maintain a special dietary program than people who are
+
)
= 50%.
+.15
.35
RB =
For Cohort B,
) (
=
.63 + .03
.90
)
Now
"tense."
+
trial of
66%.
suppose
we perform
in
considered only the total outcome
these
two
we would falsely conB is inherently more treatment A, when in fact the rates,
clude that treatment
than two treatments had equal potency. The
effective
was due
ference in total outcome
ences
maintenance
in
the
of
dif-
to differ-
treatment,
not in the inherent efficacy. If treatment
A were
reorganized in a more convenient manner that made its maintenance easier, the results might become just as good as those obtained with treatment B.
A
example of
practical
this
problem occurred
during studies of antibiotic prophylaxis to prevent streptococcal infections and recurrences of rheumatic fever in patients with previous attacks of
rheumatic fever. 28 For patients receiving monthly
benzathine penicillin, the
injections of long-acting
and recurrences were
attack rates of both infections
substantially
lower than
those
patients
for
received daily doses of oral penicillin. injection of penicillin
ensured
almost always accomplished or in the patient's
home, the
in
its
who
Since the
receipt,
and was
the research clinic
fidelity of
prophylaxis
could be assured in the "injection" cohort. For the cohort, however, the daily ingestion of medication could not be supervised. Accordingly,
"oral"
in
order to determine whether the superior results
obtained with the injections were due to pharmacology or to maintenance, special procedures were established
1
"
to classify
the fidelity of maintenance
When
regimen as excellent and not excellent. the injection group was found to have better
results
than the excellent oral group, the distinction
for the oral
is
assigned
randomly by the investigator, rather than self-selected
bv the
cohort,
the
patients
who choose to maintain the maneuver (or who are able to maintain it) are a selfselected
group,
and the
results
may be
by unrecognized distinctions that were involved in the selection process. For example, let us again assume that people biased
Having learned
about the bias involved in failing to assess fidelitv of a maneuver's maintenance, we earefullv perform such assessments, and we then exclude from consideration the people who did not maintain the diet faithfully. Our group of faithful dieters will now be composed mainly of "relaxed" people, because the "tense" people will not have adhered to the dietary program. Our group, which did not have to adhere to a diet, will be composed of a mixture of "tense" and "relaxed" people. Even if the diet has no effects on mortality, the resultant death rate will nevertheless be lower in the faithful dieters than in the "control" group, because the dieting cohort will contain a predominance of "relaxed" people, with low death rates. 3. Equal detection of target. Another tvpe of transition bias, discussed earlier 20 in this series of papers, has been almost totally ignored in epidemiologic research, and acts as a major flaw in the validitv of cohort statistics dealing with causes of dis"control"
ease.
Many
acute ailments, such as strepto-
coccal infections and thrombophlebitis, and
many
chronic diseases, such as lung cancer
and coronarv artery
disease,
are not de-
tected unless the patient receives regular
medical surveillance and/or special diagFor example, as noted in
nostic testing.
could be attributed to pharmacology.
Even when the maneuver
a clinical
a special dietary program intended
to reduce atherosclerosis.
we
Let us
further assume that "relaxed" people will
For Cohort A,
If
79
bias'
neeropsv examinations, about 20% of pawith lung cancer 25 and about 50% 2 of patients with coronarv artery disease
tients
had not received the appropriate diagnosis during
life.
Because so
many major
ailments
can
escape diagnostic identification, an important source of bias in
compared cohorts
is
an unequal opportunity to have the target event detected. To avoid such bias, the
architecture of cohort research
The
80
two cohorts should be
relatively similar in
the frequency, intensity, and scope of med-
—
particularly those techical examinations niques bv which the target event is detected. For example, if lung cancer is the
sought in a comparison of smokers and nonsmokers. the results may be biased if chest x-ray examinations were routinely performed more frequently in event
target
the smokers.
laborers
If
coronary arterv disease
sought
target
tin-
a
in
comparison
and executives, the
results
same
m;iv
and
1
We
2.
would then
For Cohort A,
RA =
k A d,r + (l-k A )d,r
=
k A r (d,-dj) + d.r
and
= =
R„
Cohort B,
for
By
k B d,r +(l-k u )d.,r
k B r(d,-d 2 ) +
=
R,-R n If
A
d.r.
subtraction,
is
of
strata
in
find the following.
r
[(kA-kB )(dx-d2 )].
and
k A -k,.
i,
ck
d,
A,,
,
this dif-
ference can be expressed as rA k A,i, a value
be biased if electrocardiograms were routinely performed more frequently in the
which indicates the bias produced by
executives.
strata distributed disproportionately in
To
illustrate the quantitative aspects of
type of bias,
this
bophlebitis
MK of a who have [
but of
when
it
occurs in
stratum that consists of
women
frequent medical examinations,
detected
is
us assume that throm-
let
detected
is
in
only
women examined
40%
of a stratum
infrequently. Let us
assume further that a cohort of women taking the "pill" contains 75* } of women from the first stratum and only 25% from the second, whereas a cohort of women using
==
=
.75-.45
.30;
A
(]
=
and A k A,, = (.30) (.40) = 12%, which was the difference between 70% and 58%. These problems are an "incidence" coun.80-.40
=
terpart
of
.40;
the
discussed in the
fallacies
many
years
ago
pointed
out
that
"prevalence" bias noted
by Berkson.
when tions
1
Berkson
the prevalence of certain manifesta-
was compared
in patients hospitalized
for the "pill" as for other
which people with the compared diseases had been admitted to the hospital. In the fallacy under discussion here, the incidence
women from the first stratum from the second. Finally, let us assume that r. the rate of development of thrombophlesame
Ak
ing example,
with different diseases, the results could be biased by disparities in the rates at
of
the
two
cohorts. Thus, in the immediately preced-
45% and 55%
other forms of contraception contains
bitis, is
dis-
parate detection of the target event in two
contraceptive agents.
The reported
rates of
of target events in cohort groups can be
thrombophlebitis in the two cohorts would
biased by disparate rates of examination
be
procedures for detecting the occurrence of
as follows:
In the "Pill" cohort, (.75)(.80)r (.25)
In
(.40)r
the
=
(.60)r + (.10)r
"Non-pill"
cohort,
the target. Although both Berkson's point
= 70%
(.45)(.80)r
r.
and the one cited here have generally been
+
neglected in epidemiologic research, the
(.55)(.40)r = (.36)r + (.22)r = 58% r. Thus, the reported rate of thrombophlebitis in the "pill" cohort
would be
substantially
lack
even though the "pill" had no on thrombophlebitis.
dif-
The algebra of this tion can be shown as d 2 be the
"tilted target" situa-
follows: Let dj
and
different rates of detection for
and 2. Let k A and k B be the proportions with which strata 1 and 2 occur in Cohorts A and B. Let r be the true rate of the target event, which is the the target in strata
1
is
to
disappear
such bias does not cannot provide
and
assurance that
Such attention
it
is
absent.
to bias in target detection
generally not a major necessity in thera-
peutic
ferential effect
it
scientific
higher than the rate in the "non-pill" cohort,
attention
of
make
trials,
usually
because the investigator can
make advance
plans for the target
event to be assessed equally in the different cohorts subjected to treatment.
When
the
performed as a survey, however, the investigator cannot arrange for the target event to have been equally detected, and must beware that bias has oc-
research
is
Sources of 'transition
remove systematic
81
bias'
curred in the rates of detection for different
to
cohorts. In intervention-cohorts, the thera-
while assigning the compared maneuvers
peutic circumstances target
with
may have made
the
event more likely to be detected one form of treatment than with
example cited
another. Thus, in an
where 18 for a
else-
clinical trial of antistrepto-
coccal prophylaxis, streptococcal infections
were more
likelv to
be detected in patients
to cohorts derived
bias
in
allocation
from the same candidate
same calendar
population, during the
in-
terval.
The main systematic in these
circumstances
bias to
is
be feared
the effect of pre-
vious transfer decisions that
may
render
the candidate population unrepresentative
whom
receiving injected penicillin than in those
of the patients to
receiving oral penicillin.
be extrapolated. For example, when a par-
In cause-cohorts,
phenomena
may
associated
the results will
ent population at a hospital
is
"screened"
have led to biased detection of the target disease., Thus, the frequent examination of "pill"-takers may produce a spuriously higher rate of thrombophlebitis than in the less frequently examined women who use other means of contraception. In course-cohorts, the different rates of disease 14 noted from one era to the next or in different geographic regions may be a reflection merely of the detection bias due to differences in diagnostic criteria and methods of case
for admission to a therapeutic trial for a
finding.
lation.
with a "causal agent"
also
In
certain
research
described later),
some
circumstances
(as
of the hazards just
can be eliminated or minimized. In most forms of cohort statistics, however, various forms of bias are inevitable, and an
cited
investigator's lidity
is
main hope
for scientific va-
to attempt to reduce or "adjust"
the bias. 1.
Allocation of maneuver. In the experi-
mental circumstances of an interventional trial, the investigator begins with a defined candidate
population.
He
creates
the
and "control" groups as he assigns the successive patients to be exposed or nonexposed to the maneuver under study. By allocating the maneuver according to "treated"
principles of randomization, the investiga-
can attain three important goals: he
tor
can eliminate systematic bias in the alloca-
maneuvers; and he can ensure that the compared cohorts will be concurrent tion of
in
both populational source and calendrical The randomization procedure serves
time.
may be
excluded because they are "uncooperative," too ill to participate, or afflicted with major
co-morbid ailments. As a result of these exclusions from the parent population, the candidate population will become disproportionately altered in a
manner
that im-
pairs the validity of extrapolation to other
patients with the "same" disease,
the random
allocation
despite
of treatment that
formed cohorts from the candidate popu-
The magnitude of this problem can be reduced if the candidate population is
The reduction of transition bias
D.
particular disease, certain patients
carefully identified, so that the extrapolation
can be suitably cautious, and if a log" 24 is kept to record the
"screening
characteristics
and outcome
memwho were
of the
bers of the parent population
excluded from the trial. Thus, in a trial-of porta-caval shunt for patients with esophageal varices, perhaps the most striking result was that the candidate populaof therapy, had a much outcome than the members of the parent population who were not admitted
tion,
regardless
better
to the trial.
In contrast to a therapeutic trial, the in a survey is chosen not by
maneuver
the investigator, but by nature, patients, or the patients' doctors. The investigator's inability to assign the maneuver is, in fact, the main feature that differentiates an experiment from a survey. In both types of research, the investigator can establish "control" groups for comparative purposes, but in a survey the investigator cannot
The architecture
82
control
of cohort research
govern, assign, allocate) the
(i.e.,
maneuver. Because the exposure or nonexposure to the maneuver is not determined by the cohorts
the
investigator,
of
surveys
are
chosen quite differently from those of experiments. In an experiment, a candidate population is assembled and then divided into cohorts to be exposed or noncxposed
maneuver. In a survey, the cohorts are determined, before the research begins, by their members having already been previously "assigned" to whatever maneuver they received. Since this assignment was
to the
-
not carried out
random
l>v
allocation to a
might be examined in patients who are in "good" or "bad" stages of disease. In a survev designed to determine whether the predisposes to cervical cancer, the
"pill"
gynecologic condition of the cohort groups can be determined. In one recent initial
study, 89 for example, the to
use the
ing
high-fat
or
familial,
status,
unequal
in target susceptibility
because of
decisions
made when
the maneuver
was
selected. In situations
where the ma-
neuver is chosen by doctors, a bias may arise because of pretherapeutic criteria that prognostic
create patients
assigned
Thus,
therapy.
disproportions different
to
in
cancer
among
modes of
therapy,
the
cancer
select
to
may
one type of maneuver in prefanother. Thus, people who
or to use the "pill"
tinctly
different
may have
dis-
prognostic characteristics
from people who do not choose these maneuvers. Although such bias cannot be avoided in survevs,
a
scientific
investigator
will
try
reduce its effects by "adjusting" the data. These adjustments consist of performing a prognostic stratification, and then comparing results of different maneuvers
to
ii
su.
same prognostic strata. Thus, in a ey of treatment of cancer, the results
the
then
strata of the cohorts.
The crucial issue in this type of adjustment is a prognostic stratification for the members of a cohort. The creation of the strata
depends on the
recognize
to
the
investigator's ability
particular
initial-state
prognostic unless the demographic differ-
is
In
choose to smoke cigarettes, to eat low-fat diets,
psychic
chosen by patients, exist in people
outcomes than those
"inoperable." 19
prognostic differences
erence
as
and longevity The effects of the maneuvers be compared within similar
situations
unresected, the "operable"
where the maneuver
who
features
ethnic background,
of parents.
can
according to such
strata
prognostic
These properties and must be shown to correlate predictively with the target event. 15 Although certain demographic distinctions such as age, race, and sex are often used to establish strata in
patients have better
who were
smok-
personality,
properties or "risk factors" that affect prog-
left
is
the
provide a
with patients prognoses than those
"inoperable."
of cigarette
diets,
who have better who are deemed Consequently, even when the
criteria for "operability" usually
surgeon
than
and ethnic features of the cohorts
be divided into
that
the
chose
have sub-
can be assessed, and the populations can
beware
were rendered
to
dysplasia
cervical
were begun. For surveys
possibly
cohorts
more
stantially
women who
were found
non-pill users before contraceptive agents
candidate population, the investigator must the
"pill"
nosis for the target event.
cannot be chosen
—
—
cohorts,
these
strata
ences have been bility
arbitrarily,
to
are
shown
the target
not necessarily
to affect suscepti-
event.
Thus,
if
the
prognosis for survival in cancer depended mainly on whether the cancer was localor metastatic, and if the prognosis was otherwise the same for patients who were old or young, black or white, men or women, our search for bias would be
ized
concerned mainly with a possible disproportion between localized and metastatic
We would not be particularly concerned about demographic disproportions, 2. Ascertainment of maneuver. In an a preplanned experimental trial or in longitudinal survey, the investigator can
I
i
strata.
j
83
Sources of 'transition bias
make
specific
advance
arrangements
to
which patients
ascertain the fidelity with
modern epidemiologic surveys
of the
all
of
etiology of disease in cause-cohorts, no ef-
adhered to the assigned maneuver. He can then divide the patients according to good or not good compliance, and deter-
fort has
mine the outcomes in the different strata of "fidelity." These procedures can also be attempted when the survey is based on existing records compiled by other people,
veys have usually determined the occur-
rence of the target event not by an active research procedure, but by the passive
although the investigator
cates.
the
that
find
is
less
likely to
necessary information was
obtained and recorded. In
many modern cohorts
of
surveys
trials
or
preplanned
where such
ascertain-
ment could readily be achieved, however, investigators have seldom made the required efforts. Adherence to maneuvers has been either omitted from assessment, the
only via questionnaires
that
were not rechecked or evaluated for
relia-
or
assessed
bility of
answers. Thus, reports of success
pharmaceutical treatment of angina pectoris, or in diet-plus-drugs treatment of in the
obesity are seldom
accompanied by data
the results according to the with which the prescribed regimen was maintained. In epidemiologic studies indicating
fidelity
of
outcome of cigarette smoking or
the
different types of dietary or exercise patterns,
the investigators have seldom em-
ployed repeat questionnaires or other appraisals to
tory
check the accuracy of the hissmoking, dietary, or exercise
of the
maneuvers
in
which the cohorts had en-
gaged. 20 In the absence of such ascertain-
ments and subsequent stratifications, the investigators have failed to "control" their cohorts for the hazard of an important
been made to search
The
investigators in
"poll-bearing" 20
often
incorporated
therapeutic
trials,
into
assessments
the
protocol
are of
the attempt to discern
or exclude bias in equality of opportunity for target detection has
been almost wholly
absent in epidemiologic research. In almost
certifi-
occur without being recorded on the death certificate. Thus, such ailments as coronary artery disease or lung cancer are target events that can often occur with-
out being detected during less
life.
14, 23
2 '
Un-
the disease has been diagnosed and
the patient has been specifically noted to have died because of that disease, the death certificate will contain a "false unless
negative" report of the target event. Al-
though the investigators of etiologic cohave sometimes checked death certificates for "false positive" diagnoses, no reports have been published of efforts to detect and adjust the bias introduced by horts
14 "false negative" reports.
A
second source of bias, even when atmade to check "false negative" diagnoses, is an unequal intensity of diagnostic procedures in compared cohorts. The data from death certificates do not tempts are
indicate whether the compared cohorts were examined in equally assiduous manners. Since most etiologic cohorts engaged in self-selected maneuvers and received
diverse forms of medical attention there-
the
after,
such
death
of
may
scientific
Although
analysis
of these sur-
move two types of bias that creates "tilted targets." The first is that the target event
Ascertainment of target detection. A preplanned longitudinal survey also allows the investigator to check the intensity of procedures for target detection in compared cohorts.
many
This type of analysis cannot re-
source of transition bias. 3.
for possible
bias in target detection.
death
certificate
provides
no
assurance that these unplanned,
nonrandom activities were performed without bias. The assurance can be provided only with specific investigations of the in-
Such have not been contemporary
tensity of target detection procedures.
investigations, unfortunately,
performed or considered
in
epidemiologic strategies.
Although a
scientific
investigator estab-
lishes various types of "control" techniques
The
84
architecture of cohort research
bias
eliminate-
to
proce-
experimental
in
dures, epidemiologists have given almost no attention
analogous forms of "control"
to
major sources of transition
For (imj of the
bias
A
search.
effort
scientific
inequalities
possible
to
mam-
to
tional
maneuvers, and detection of targets has been strikingly absent from modern epidemiologic statis-
performance
targets,
of
tics.
4.
Representation of antecedent populaIf
compared maneuvers have
the
been randomly allocated, the cohorts will usually be representative of the candidate populations.
randomly
maneuvers were not
the
If
the
assigned,
can
investigator
use prognostic stratification
(as described
earlier) to help adjust for possible bias in
the assignment.
Even when the maneuvers
were random 1) allocated, however, nostic stratification
a prog-
important to discern
is
the bias that may occur as a chance event during randomization, and also to avoid the problems cited earlier when maneuvers
have opposing
effects
in
different
prog-
absence
the
In
investigate
susceptibility
in
trials of
In addition to these comparisons,
how-
would like to extrapbevond the candidate populations from which the cohorts were derived. As soon as he contemplates this
random sampling,
of
types of bias can enter these popula-
For diseased cohorts, the
transfers.
transfer from
general population to base
population will be affected by medical standards of practice with regard to "early" detection of lanthanic disease in patients
who
are
asymptomatic or who have no
complaints
related
the
to
disease.
The
from base population to the parent population found at a particular medical setting will be affected by patients' iatrotransfer
tropic stimuli, 12 terns, cial,
by inter-iatric referral patand by diverse personal, ethnic, soor economic influences. The transfer
from parent population to candidate population will be affected
by the stage of by the severity of co-morbid ailments, and by the patients' severity of the disease,
willingness to accept the proposed diagnostic
and therapeutic procedures. All of way in which a
these features can alter the
cohort represents
nostic strata.
the past few decades,
none of the cohorts represented a random sample of people with the particular disease or clinical condition under treatment.
the cause-cohorts of etiologic re-
in
tions.
therapeutic
proportions
in
its
of
antecedent populations people from different
ever, an investigator
strata within the disease.
olate
For healthy cohorts, these transfers can be associated with various forms of bias. For example, the "healthy people" who form a candidate population by responding to a questionnaire may not all be healthy. Consequently, unless the initial state of each member of the candidate population is checked appropriately, the
his
results
extrapolation, however, the investigator
confronted by fers
that
all
is
of the populational trans-
occurred
between the general
population and the candidate population.
In ideal circumstances, each of these trans-
—
from general to base to parent to candidate population should have been performed via random sampling from the antecedent population. In practical reality, fers
—
however, these goals achieve that
are
so
random samples
difficult
to
are almost
also
may
compared
cohorts
portionate
amounts of people who are and who may even have de-
already
ill,
contain
dispro-
veloped the disease or target event that was to be sought later on. Berkson 5 has described the fallacy that can occur when
clinical
cohorts are created from disproportionate
For example, none of the cohorts investigated in research on cigarette smoking or the "pill" was obtained by random sampling from a base population of smoki s and nonsmokers, or pill users and non-piJ users. In the statistically designed
numbers of sick and healthy people who have returned questionnaires submitted to
never used research. 20
in
epidemiologic
or
a
general
population.
After
a
"healthy"
candidate population has been assembled
by interviews or questionnaires, some the people
may be
of
rejected because of im-
85
Sources- of 'transition bias
proper or inadequate replies, and
the
if
form a distinctive stratum, the cohort populations may be proportionately altered by the absence of members from rejectees
of these transfers of diseased or
all
common form of prognostic "adjustment" in epidemiologic cohorts is to
stratify
an unequal susceptibility to the target event when the transferred people enter the candidate population that forms the
of
compared
cohorts. Consequently,
in reference to
justment"
based on prognostic
tion
an "adstratifica-
the investigator's best hope for re-
is
ducing the bias that of
validity
may
and
"intake" of cohorts,
during the
arise
for increasing the
subsequent extrapolations. In
order to perform the stratifications,
how-
the investigator must be aware of
ever,
and must have the outcome in
the possible sources of bias
data for testing
suitable
on the purely demographic basis
age,
or sex.
race,
(The frequency
according to age
stratifications
of
respon-
is
one of the uses of the word cohort a particular age group. With such procedures, the results of a cohort may be inspected separately in such demographic strata as old people and young, whites and blacks, men and women, but the psychogenetic risk factors are seldom analyzed probably because the necsible for
—
and would require rigorous investigative efforts to be obtained. An alternative type of demessary data are not readily available
because
ographic "stratification" consists of perform-
prognostic distinctions have generally re-
ing the same investigation in an additional
different
Unfortunately,
strata.
ceived so
quantification
little
the
biology,
clinical
modern
in
necessary
data
are
cohort that has occupational or geographic properties different from those of a group
seldom available, and the necessarv stratifications are seldom performed. Such stratifications are particularlv im-
the same occupation or in the same geo-
portant for the epidemiologic cohorts in-
graphic region
was studied
that
previously.
results of a cohort containing
Thus, the people with
may seem more
extrapolat-
similar results are found in other
vestigated for causes of disease. In clinical
able
cohorts studied for the course or treatment
cohorts with different occupations or geo-
of disease,
the that
an investigator can usually state
diagnostic
pretherapeutic
or
differentiate
his
who have
people clinical
cohorts
the
same
criteria
from other disease.
In
specific
With these
criteria for identification
can often
clearly define the type of populations
In
his results
to
cause-cohorts of epidemiologic however, an investigator begins with a generally healthv group of people and cannot classify them on the basis of an
the
disease
or
previous
experience
with the course of that disease. quently, an epidemiologist larly
The
stratifications
alert
for
the
Conse-
must be particu-
possibility
susceptibility
affect
Even when
to
the target
event.
age, race, sex, occupation, or
geographic region have prognostic correlations, the role of psychic, ethnic, ial
cannot
factors
be
ignored
and
famil-
as
even
greater sources of prognostic bias. Thus,
would apply.
research,
existing
veniently available data.
demographic factors have been shown to
when
stratification, the investigator
which
These demographic stratifications, howbased on arbitrarily chosen, con-
ever, are
he can often use previous
survey data are not available to denote the
and
graphic regions.
experiences to arrive at a suitable
trials,
prognostic stratification even strata.
if
are not necessarily prognostic unless the
therapeutic
'
predisposing factors to such target
events as disease and death. Nevertheless,
healthy people, the main hazard of bias is
.
may be
the most
that stratum.
In
"psychogenetic" features as psychic state, ethnic background, and parental longevity
that
such
in a correlated prognostic stratification for
mortality of healthy people, the longevity of parents, ethnic background, or psychic
anxiety
may be much more
"risk factors"
of sex, race,
important as
than the demographic aspects
and occupation. The persistent
neglect of these psychic and genetic features
is
an additional impairment to valid
architecture of cohort research
The
86
cause-cohorts
the
for
extrapolation
re-
ported in modern epidemiologic statistics. In the clinical populations studied as intervention-cohorts, many important pheare also regularly neglected during "staging" procedures or other attempts at prognostic stratification. Among the
nomena
omitted clinical features that
may
delineate
prognosis are the cluster and sequence of
symptoms, the chronometrv of clinical events, and the co-morbidity of associated
Among
ailments.
may
the "decisional data" that
prognosis but that are usually
affect
ignored in
statistical
assessments are the
iatrotropie stimuli, inter-iatric referral pat-
diagnostic
terns,
criteria,
been discussed pre\ iousK adjust
to
procedure, in
for
bias
in
differential
constitute
pediment
scientific
to
a
and
major
im-
both forms of cohort in
research. E.
"Migration
and the unstable
bias"
cohort
In
all
of the cohort
scribed, the
members
problems
just
of the cohorts
de-
may
have been unrepresentative of their antecedent populations and may have been
compared stable.
unfairly,
Once
its
but each cohort was
members had been
noted,
they persisted in their role as the denominator of anv subsequent particular
citation suggests that the death rate was obtained in a stable cohort, neither the
numerator nor the denominator of the rate came from a stable cohort. At the beginning of 1967, no attempt was made to count the number of people in Connecticut who were 45 to 64 years old; and at the end of 1967, no attempt was made to determine how many of those same people had died. Instead, since the last census count was taken in 1960, the denominator of 638,000 1967 represents an intercensal
in
from people who had been counted seven years earlier in the age group 38 to 57. The numerator represents the counted number of deaths reported in Connecticut in 1967 at ages 45 to 64. Regardless of the accuracy with which the denominator was estimated, the cited death rate does not refer to a stable cohort. During the seven years since the 1960 census, many people who were originally in the 38 to 57 age group had moved away, and other people, who were 45 to 64 years old during 1967, had entered Connecticut from other geographic regions. If the death rate
progress
epidemiologic and clinical
is
the sampling
existing delects in techniques of prognostic stratification
group is cited as based on 6,405 counted deaths having occurred in an estimated 638,000 people.) Although this manner of 1003.9. This rate
estimation, obtained with diverse "adjust-
maneuvers, the
of
effects
the death rate per 100,000 popula-
neces-
is
allocation of maneuver,
in
listings,
tion in this age-regional
people
.
prognostic stratification
Since sary
pretherapeutic
criteria,
and patient acquiescence that have
tions of national vital statistics. 40 (In those
statistics.
form of epidemiologic
however, the data are presented
In one
statistics,
as
though
they were obtained from a cohort, but the
ments,"
among among
;
was the same
emigrants
the
the apparent cohort would be unaffected.
But
if
versa,
were substantially
emigrants
the
healthier
than
the
immigrants,
would be
distorted.
when death
ent geographic regions
the cohort.
For example,
the rates of death are cited in "vital
statistics"
for
a
particular geographic re-
For example, the death rate in Connecticut in 1967 for people aged 45 to 64 years was about 1%, according to tabulagion.
vice
This type of bias must be considered
violated because of migration to and from
This type of "migration bias" can occur
or
the result for the apparent cohort
basic principle of cohort stability has been
when
as
the immigrants, the total rate for
rates are
if
compared at
for differ-
different
eras.
elderly people with a high
death rate move to a region that is regarded as "good" for "old age," the death rate may soar for elderly people in that region. Instead of being regarded as a healthy locality, the region may then be-
come regarded
as unhealthy.
Sources of 'transition
There
is
no readily available method
of
adjusting for this type of "migration bias."
Even
the decennial census of geographic
if
regions could be performed annually, the
problem of intra-annual' migration would
The migratory shifts denominators can be approximated bv various methods/' 3S but migration is not accounted for in the numerators. One approach would be to arrange for standard still
create difficulties.
in the
death certificates to include data for not only the location of death and the "usual residence" of the deceased, but also an indication of
how
long the person had lived
and the duration of time at the place where the person had lived previously. With appropriate analysis of such data, the numerator events could be more accurately demarcated for
87
bias'
by appropriate randomization and advance planning. The problems of "chronology bias," however, are created by the investigator himself; and the problems cannot be resolved by any of the tactics just noted. There is no way, in fact, to get possible,
rid of "chronology bias" because
When
remediable.
wrong
gator has chosen the
The problems and the
it
is
ir-
occurs, the investi-
it
cohort.
of these distorted cohorts
difficulties
created by several other
logical errors in statistical procedures for
cohort analysis will be the topic of our next discussion
two months from now.
at the "usual residence"
the
denominator
cohorts
in
whom
References 1.
2.
the
events presumably occurred. •
Almost
all
•
•
of the problems just described
Bailey, N. T.: Statistical
New
3.
are caused
by an
investigator's inability to
govern and regulate the
The
people.
4.
human
for biologists,
Sons, Inc.
:
1968, George Allen
Berkson,
& Unwin,
Ltd.
Limitations of the application of
J.:
fourfold table analysis to hospital data, Biomet.
lives of free-living
additional data that the in-
vestigator obtains for a
&
Beadenkopf, W. G., Abrams M., Daoud, A., and Marks, R. V.: An assessment of certain medical aspects of death certificate data for epidemiologic study of arteriosclerotic heart disease, J. Chron. Dis. 16:249, 1963. Benjamin, B. Health and vital statistics, London,
in the "transition" bias of cohort research
methods
York, 1959, John Wiley
Bull. 2:47-53, 1946. 5.
Berkson,
The
J.:
study of association
statistical
between smoking and lung cancer, Mayo
cohort and
Clin.
Proc. 30:319-348, 1955.
the additional stratifications that he per-
6.
forms are intended to compensate for the
cal
introduced by his lack of "control"
bias
over the procedures of sampling, allocation,
E. 7.
&
S.
Livingstone, Ltd.
Bradford tistics,
maintenance, and target detection in hu-
Hill,
ed. 8,
of medical sta-
Principles
A.:
New
York, 1966, Oxford Univer-
sity Press.
man
Another
populations.
formed
in
cohort
by the
per-
activity
however, is completely "con-
is
investigator.
As the investigator reviews the assembled data for each potential member of the cohort, he must make certain critical decisions about the timing and chronologic sequence of events in that person's life. If the investigator
makes these chronologic
decisions improperly, he can create a type of bias quite different cited.
The problems
result
of
from the kinds
just
of "transition bias" are
made by nature, doctors. The problems
selections
and other can be reduced or minimized by suitable patients,
identifications
and
8.
research,
purely intellectual and trolled"
the
Hill, A.: Statistical methods in cliniand preventive medicine, Edinburgh, 1962,
Bradford
stratifications, and,
when
Campbell, R.
C:
Statistics for biologists,
Cam-
Cambridge University Press. Cochran, W. C, and Cox, G. M.: Experimental designs, New York, 1950, John Wiley & Sons, bridge, 1967,
9.
Inc. 10. Dixon,
W.
J.,
and Massey, F.
to statistical analysis, ed. 3,
McGraw-Hill Book
J.:
Introduction
New
York, 1969,
Co., Inc.
and prospective studies, editor: Medical surveys 2, London, 1964, Oxford
11. Doll, R.: Retrospective
chap. 4, in Witts, L.
and
J.,
clinical trials, ed.
University Press. 12. Feinstein, A. R.:
1967,
Clinical judgment, Baltimore,
The Williams & Wilkins Company.
13. Feinstein,
A.
R.,
Spagnuolo,
and
M.,
Jonas,
S.,
Prophylaxis of recurrent rheumatic fever. Therapeuticcontinuous oral penicillin vs. monthly injections. A. M. A. 206:565-568, 1968. J. Kloth, H., Tursky, E.,
14. Feinstein,
A.
R.:
Levitt, M.:
Clinical
epidemiology.
II.
The architecture
88
The
of cohort research
demiologic
Ann. Intem.
identification rates of disease,
A.
15. Feinstein,
The
epidemiology.
Clinical
R.:
clinical design of statistics in therap)
The epidemiology
C. R.:
clinical course:
demarcations,
poral 2
IT
of cancer therapy. II. Data, decisions, and tem-
Arch.
Intern.
Med.
123:
1969.
1-344,
A.
Feinstein,
Clinical
R.:
biostatistics.
Statistics versus science in the
theory of 28. Lewis,
30.
A. R.: of
architecture
The
Clinical biostatistics. V.
research
clinical
19. Feinstein, A. R.:
Scientific defects in the stag-
ing of lung cancer, pp. 1'..
ed.
1005-1011,
in
Staff
Conference.
Biostatistics,
New
T. F.
Hafner
York,
1966,
Epidemiology.
:
medical poll-hearer, Cl.lV PHARMACOL. Ther. 12:134-150, 1971. J.
and Elveback, L.
Hall, C. E.,
P.,
Man and
Epidemiology.
R.:
disease, Toronto, 1970,
Elementary medical
D.:
Philadelphia,
2,
1963,
W.
:
Introduction
33. Morris,
Little,
34.
sciences,
Englewood
Cliffs,
35. Schor,
S.:
Fundamentals of
A
38.
Eng.
J.
Med. 270:496-500,
1964. 23. (Goldstein,
N.
J.,
1970, Pren-
biostatistics,
New
York, 1968, G. P. Putnam's Sons, Inc., p. 222. 36. Snedecor, G. W., and Cochran, W. G: Sta-
37. Sokal, R. R.,
of prophylactic porta-caval
1967,
Calif.,
tice-Hall, Inc., pp. 340-341.
tistical
New
probability
to
N.: Uses of epidemiology, Baltimore,
J.
State University Press.
trial
Saunders
The Williams & Wilkins Company. Remington, R. D., and Schork, M. A.: Statistics
Donaldson, R. M., O'Hara, E. T., Callow, A. D., Muench, H., Chalmers, T. C, and the Boston Inter-Hospital Liver group: controlled
statistics,
with applications to the biological and health
J.,
shunt surgery,
B.
1964,
The Macmillan Company. 22. Garceau, A.
W.
and statistics, ed. 2, Belmont, Wadsworth Publishing Co., Inc.
Med. 73:
1003-1024, 1970. 20 Feinstein, A. R.: Clinical biostatistics. VII. The rancid sample, the tilted target, and the
methods, ed.
Ames, Iowa, 1967, Iowa
6,
and Rohlf, F. J.: Biometry, San W. H. Freeman & Co. Spiegelman, ML: Introduction to demography, revised ed., Cambridge, 1968, Harvard UniverFrancisco,
1969,
sity Press.
A.:
Biostatistics,
The Macmillan Company. (Ireenberg, B. C: Conduct clinical trials,
Amer.
New
York,
1964,
of cooperative field
Statist.
13:13-17, 28,
Heasman, M.
39. Stern,
E.,
Clark,
V.
A.,
and
Coffelt,
Accuracy of death certification, Proc. Roy. Soc. Med.' 55:733, 1962. 26. Irvington House Croup: Rheumatic fever in children and adolescents. A long-term epiA.:
C.
F.:
Contraceptive methods:
Selective factors in a
study
the
of
dysplasia
of
cervix,
Amer.
I.
volume
II.
Public Health 61:553-558, 1971. 40. Vital Statistics of the United States,
June, 1959. 25.
1963,
and methods, Boston, 1970,
32. Mendenhall,
Carbone,
NCI Combined Lung cancer: Per-
spectives and prospects, Ann. Intern.
and
York,
Company.
Transcription of
mod.:
Clinical
24.
E.:
Brown & Company. MacMahon, B., and Pugh,
31. Mainland,
(concluded),
Clin. Pharmacol. Ther. 11:755-771, 1970.
Fox,
The advanced
Stuart, A.:
New
Brown & Company.
18. Feinstein,
21
A.
Principles
L970.
1'.
and
G.,
statistics,
Reinhold Publishing Corporation. 29. MacMahon, B., Pugh, T. F., and Ipsen, J.: Epidemiologic methods, Boston, I960, Little,
Pharmacol. Ther, 11:282-292,
ments. Clin.
prophylaxis,
Publishing Co.
II.
design of experi-
M.
27. Kendall,
Med. 69:1287-1312, 1968. Feinstein, A. R., Pritchett, J. A., and Schimpff.
The
subsequent
of
Ann. Intern. Med. 60:(Suppl. 5) 1-129, 1964.
III.
Ann.
,
Intern. 16.
study
streptococcal infections, and clinical sequelae,
Mod. 69:1037-1061, 1968.
Mortality. Part A. Tables 1-12 and 6-4, ington, D.
G,
Wash-
1969, United States Department
of Health, Education
and Welfare.
CHAPTER
7
Sources of 'chronology bias'
In the previous paper of this series, 13 a
the denominator of a cohort also depends on the
maneuver.
cause
of
cohort was defined as a group of people
investigated
who
disease or of certain types of prophylactic inter-
are followed forward in time to ob-
serve the effects of a
maneuver
they have been exposed.
The
to
which
results
are
usually expressed statistically as a ratio, or
which the denominator
rate, in
of the co-
number of people exposed maneuver, and the numerator contains the number of those people who later developed a particular target event.
hort contains the to the
The type of target event that is sought in the "subsequent state" and counted in the numerator depends on the maneuver under investigation. In a cause-cohort, the alleged is
cause
of
maneuver is exposure to an and the target event
disease,
the development of that disease. In an inter-
vention-cohort,
when
the
maneuver
of prophylactic therapy, the target
is
event
an agent is
a con-
be prevented (such as a disease or and when the maneuver consists of remedial therapy, the target is a change in an existing condition. In a course-cohort, the maneuver is exposure to time or to the course of an established disease, and the target event may be a change in an existing condition or the development of a new one. The "initial state" of the people contained in dition
initially healthy,
that
is
studies
of
deliberately chosen to be
is
from the disease
or at least free
the target event. In studies of course of a
members must
be have that disease. In certain studies of ontogenetic development, the cohort members must be initially healthy, whereas in other studies alter the course, the cohort
shown
of
—
Clin.
Pharmacol. Ther. 12:864, 1971.
all
to
associated
effects
with
passage
the
of
time,
taken from the members of a general population in a particular geographic region.
the cohort
For
is
scientific
about
conclusions
the
results,
compared both "internally" to assess effects of the different maneuvers to which they were exposed, and "externally" to cohorts are regularly
extrapolate the results to the antecedent popula-
from which the cohorts were derived. As
tions
dis-
cussed previously, 13 these comparisons are often
made tion
defective by "transition bias" in the formaand subsequent observation of the cohorts.
The
validity of external extrapolations
may be
im-
compared cohorts received disproportionate quantities of members from different
paired
if
the
prognostic
strata
member from
during
the
transfer
each
of
general population to base popula-
tion to parent population to the candidate population
from which the cohorts were selected. After
the candidate populations are associated with the
maneuvers that determine the cohorts, the This chapter originally appeared as "Clinical biostatistics XI. Sources of 'chronology bias' in cohort statistics." In
may
disease or of interventional maneuvers that
to
death);
the cohort
vention,
In
validitv
comparisons
may be impaired
if
contrasted
cohorts
unequal
initial
prognostic
susceptibility
of
internal
are
to
the
in
their
target
event,
the
The
90
architecture of cohort research
performance of the compared maneuvers, or portunity
for
in
op-
detection of the target event.
adjusted or avoided
chronologic
can be suitable data arc obtained
These diverse sources if
transition l>ias
ol
permit analysis of the cohorts within different strata of people who an- similar in prognosis, in to
performance of maneuvers, and in detection of targets. For prognostic analysis, the stratifications should l>c chosen according to clinical, psychic, genetic, or other factors that have been found to be
predictive for the target event.
distinct!)
may
strata
be
according
ineffectual
selected
it
demographic
to
data
The
arbitrarily
con-
are
that
available but that have no cogent prog-
venientl)
nostic correlations.
Another type gration
bias"
ot
transition
can
that
problem
occur
is
rates
in
the "miot
target
population ol a particular he elicits ol immigration geographic region. and emigration make the population an unstable cohort, and the subsequent total rates may be events for the general 'I
unless thej can he suitably adjusted for exchanges ol migration, or unless the rates the tun migrant strata are shown to he similar
biased the in
to those ol the Stable stratum.
These different
problems
transition
in
bias an- an inevitable concomitant of the effort to
study
patterns
free
human populations whose movement and personal
of
be governed by the investigator. The investigator's best hope in these circumstances is to be aware of the decisions cannot
potential for bias, to assemble appropriate
data for cheeking the bias, and to analyze the results with suitable adjustments.
A
quite different type of bias in cohort
however,
research,
feature of
human
creation
intellectual
not
an
and
is
is
life,
of
the
inevitable entirely an
investigator.
This type of bias can be produced as the investigator deals with the complex subtlethe
diverse
chronologic
ties
of
that
must be considered
when
features
cohorts are
assembled and analyzed. A. Chronologic issues
At are
least
cohort research
four different aspects of time
involved in cohort research.
these, is
in
which depends on the
One
of
investigator,
the chronologic relationship between the
events that transpire in the cohort and the
time
when
the investigator collects the data
used for research. The other three features depend on the members of the co-
of time
These features
hort.
calendar, to his posure
1
to
refer
position
in
own
to
a
relation
person's
the
to
and to his whatever maneuver is under life,
exin-
vestigation. I.
Investigative time. In studying a co-
hort, an investigator
can begin
members
before or after the
his research
of the cohort
undergo their exposure to the maneuver and its consequences. With either type of temporal approach, the cohort is followed from its initial state to its sub-
forward sequent
state,
but
if
the investigator begins
the research beforehand, he can collect the
data according to an advance whereas if he begins the research afterward, he must collect his data from information observed and recorded by other people. Thus, if an investigator in 1950 decides to follow a cohort assembled in 1951-1952, he can try to devise special techniques for examining the members of the cohort and for obtaining the data. On the other hand, if an investigator in 1970 research
plan,
decides to follow that same 1951-1952 co-
he must get
from whatever other information had been noted during 19511952 and thereafter. The temporal distinction refers to the time when the phenomena observed in the cohort are converted from their original hort,
existing
records
descriptive for the
his research data
or from
data into the data collected
research. This conversion can be
planned before or after those phenomena take place. In both instance's, the same cohort is followed in the same forward direction, and if the original data have been well observed and recorded, the investigator who does the research afterward should find the same results as the one who plans beforehand. For example, if an investigator wants to know the one-month mortality rate for premature babies at a particular hospital during 1951-1952, if
and
the hospital records are satisfactory, the
investigator
same
results
should
presumably get the of whether he
regardless
planned his survey during 1950 or 1970. The words prospective and retrospective
Sources of 'chronology
sometimes
temporal but these same words have also been applied, with are
applied
data
in
distinction
to
this
collection,
91
bias'
logic. The sequential direction phenomena has been completely re-
scientific
of
versed, so that the investigator looks not
in-
from cause toward effect, but vice versa. 8 The problem of scientific quality in data is always present in human research, regardless of the direction in which a population is followed, but when the population is
vestigations are often called "prospective"
followed backward the investigator inverts
quite
different
the forward or a population
if
backward
is
of
forward
longitudinal
to
which
distinctions
in
disease.
investigator performs
the
here,
The
important
etiology
of
direction in
followed.
particularly
are
studies
connotations,
scientific
statistical
Such
the type of
research
described
by following cohorts who are
ex-
posed or not exposed to a cause of disease, and by determining the rate at which the disease develops in the etiologic
research
spective"
if
is
two
often
cohorts.
called
the investigator does not study
he begins with a group people and a contrived "congroup" and he then looks backward to
of diseased
ascertain the rate at which the two groups had been exposed to the alleged cause. The words prospective and retrospective thus have two distinctly different connotations according to the way in which they are used for investigative time or for investigative direction.
In reference to col-
of data, the -spective words deal with the order of timing for two different lection
of events: whether the plan for assembling the research data precedes or follows the occurrence of the phenomena studied in the research. In reference to the
sets
temporal sequence, the -spective words deal with the forward investigation of members of a cohort toward the subsequent target event, or the backward investigation of a group of people in whom the target event has already ocdirection
of
a
curred.
usage of -spective, the main investigative hazard of "retrospection" is in scientific quality of data. Because the investigator could not "control" the original methods by which the population was obIn the
first
served and described, he the recorded
may
find
that
data contain diverse omis-
sions, imprecisions,
or other inadequacies.
In the second usage of -spective, the investigative hazard of "retrospection"
"transitional"
main is
in
difficulties
discussed
or the "chronologic"
ously,
previ-
issues
to
be
discussed here.
The
"retro-
cohorts. Instead,
trol
new problems in bias that transcend any of the
the logic of science and creates
These
additional
types
of
"inversion
complex for further consideration now, and will be reserved for a later paper in this series. The main point to be noted here is that something must be done about the incompatible disparities in the two different uses of the words prospective and retrospective. The ambigubias" are too
ities
created by the disparate connotations
are
confusing
To
thought.
and
destructive
to
clear
eliminate these ambiguities,
propose that the -spective terms be eliminated from both types of usage. We will then need two new sets of words to denote the timing of data collection and the directional pursuit of a population. Several alternate terms have already been suggested 1 for differentiating prior I
versus later plans for collection of research data.
My own
distinction collect,
lectus.
preference
assemble) with If
is
to base this
on the Latin legere (to gather, its
past participle
the research data are collected
with advance planning before the events take place, the project can protective; if the research data lected from information that was
observed
be called are col-
recorded
before the project began, the project can
be called
retrolective.
As for the directional
pursuit of a population, the
name
cohort
already exists and seems satisfactory for describing a group of people
who
are fol-
lowed forward. An appropriate name for a group of people who are followed backward will be considered later in this series of papers, during a more extended ess;r.
architecture of cohort research
The
92
on the hazards of this type of inverted logic. The remainder of the current disussion is reserved for problems in the chronology
is
of
regardless
cohorts,
of
whether the research
prolective or retro-
self
the only feature- that has changed,
is
may have
other secular changes
proportions
the
of
assembled
strata
in the
affected
prognostic
different
denominator, the
detection of target events in the numer-
2. Secular time. Secular time, or calcndrical time, refers to a particular date on a
and many important aspects of aneillarv performance in whatever maneuver is under investigation.
dates.
between such In some usages, the term epoch
ease at a particular region might be altered not
refers
to
ator,
lectixe.
calendar, or to the interval
an
individual
date.
ealendrieal
such as a particular day or year; and the term era is used to refer to the ealendrieal interval that spans two epochs. Thus, the decade between the epochs of 1920 and 1929 could be called the era of the 1920's. A change that is noted between epochs or between eras is called a secular change, and. if the change shows a monotonic direction, a secular trend.
Thus, the annual
shown"
mortalitv rate for tuberculosis has
downward secular trend during century. The survival rate for
a
the 20th patients
with Ilodgkin's disease"- 19 allegedly shows a rising secular trend during the past few decades, but the survival trend during the
same era for breast cancer legedly shown little or no rise.
A
single cohort
is
-
"• 18
has
al-
Thus,
by
tlie
the
mortality
rate
for
a
particular
dis-
progress of secular time hut by different
aspects of migration in the regional cohorts and
by
techniques of detecting the disease.
different
The
survival rate for the treatment of cancer
may
bave changed not because of secular improvements in anti-neoplastic therapy, but because newer techniques of diagnosis have altered the also
prognostic strata exposed to treatment,
and be-
cause newer methods of treating co-morbid ailments have prevented the deaths caused by diseases other than cancer. When the "same" maneuver
is
compared
in cohorts of different secularity,
therefore, the comparison in
its
may
bias
may seem
overtly "fair"
chronologic aspects, but subtle sources of arise
from the transition problems cited and
earlier in transmigration, in prognostic strata, in detection of target.
When used
cohorts of different secularity are
to
compare different maneuvers, same maneuver, the com-
rather than the
usually uni-secular in
parison
is
usually promptly recognized as
all
unfair because a change associated with
denomi-
secular transitions might be attributed er-
nator of the statistics) occurred during a
roneously to a change in the corresponding
that the pre-maneuver initial state of of
its
members
(as counted in the
limited ealendrieal interval.
To
search for
same are often compared in
secular change in the effects of the
maneuver,
results
maneuvers. For example, because of temporal improvements in diagnosis and in ancillary therapy, patients with acute myo-
uni-secular cohorts from different epochs
cardial infarction today generally have bet-
Such comparisons are particularly common for course-cohorts and interven-
ter survival rates than
or eras.
tion
cohorts.
Thus,
a uni-secular
cohort
such patients treated
before the introduction of anticoagulants the early
in
1930s.
Consequently, when
from 1940-1949 might be compared with a uni-secular cohort from 1960-1969 to determine whether mortality rate has risen
that of an anticoagulant-treated post- 1950
for coronary artery disease in the general
cohort, the
population
of
a
geographic
region,
or
whether survival rate has improved for treatment of a particular cancer. The main hazard of such comparisons arises
from the different forms of transi13 Although
tion bias described previously.
the investigator
may
believe that time
it-
outcome
a non-anticoagulanttreated pre-1930 cohort is compared with the
show
of
more recent cohort
will usually
but the difference might have little or nothing to do with the anticoagulants. 10 For this reason, compared cohorts should always be uni-secular better
unless the is
secularity 3.
results,
maneuver under consideration itself.
Life time.
The age
of each
member
of
93
Sources of 'chronology bias'
a cohort
may be regarded
generational time.
A
as life time or group of people who
were born in the same year, or within a few years of each other, may be called a generation cohort or an age-specific cohort. When an ontogenetic occurrence, such as weight gain in newborn infants, is measured from the date of birth, the cohort obviously contains state
age
members whose
similar. In
is
many
initial-
other circum-
however, a cohort containing people of different ages may be divided into stances,
sub-cohorts)
(or
strata
specific."
that
"age-
are
This type of age-stratification can
Another type of liberately
age-stratification
prognostic,
to separate
young
de-
among peo-
of target susceptibility
tions
ple of different ages. Thus,
is
distinc-
elderly
if
and
adults with maturity-onset diabetes
have disparate survival rates, an would be needed to eval-
mellitus
age-stratification
uate the mortality effects of different forms
The main problem
of treatment.
type of age-stratification
is
in
that
this
may
it
not be cogently prognostic. 10 As discussed previously, 13 age for
often a favorite variable
is
because the data about age are usually reliable and statistical
stratifications
be performed for at least three different
easily
purposes.
prognostically important as certain clinical
One type of age-stratification is used mainly for descriptive separation, as a way
and co-morbid features of a diseased cohort, or psychic and genetic features of a
of
"matching" the age of cohorts that are
available,
stratification
For example, if we want to know whether people who lived in Connecticut
statistically
in
had better
1967
those
who
survival
rates
lived in Florida, the
under comparison
is
than
maneuver
living in Florida ver-
To help get a fair comparison and to reduce the effects of transmigration, we might then contrast the sus living in Connecticut.
—
ial.
Unless
shows cogent changes with the different age strata, the "age-specific" rates may produce statistical fertilization without scientific fruit.
A
particularly
fication
These are two main hazards of bias comparisons.
The
in
such
each cohort (or stratified sub-cohort) in the denominator is estimated rather than counted from decennial census data. The estimate for an intercensal year might be quite inaccurate if a substantial imbalance existed between immigration and emigration. A second hazard is not the numerical accuracy of the denominator, but the bias caused by disparate death rates in the two migrant populations. 13 Thus, if many healthy people in the age group from 45-64 moved to Connecticut from Florida, and if the same number of sick people in that age group moved to Florida from Connecticut, the denominators of the agecohorts would be unaltered, but a major change could take place in the numerator data for deaths. Connecticut might then appear "healthier" than Florida, but only because of "migration" bias in the rates of death for the migrant members of the cohorts.
first
is
that
the
size
of
common
type of age-strati-
performed for
is
circumstances,
such as the determination of general mor-
of people
—
triv-
the rate of the target event
1967.
different
may be
based on age alone
tality rates, in
in
often not as
impressive but biologically
age groups such as strata of people below age 45, 45-64, and above 64 in those two regions during results
is
healthy cohort. In such circumstances, a
being compared for a maneuver other than age.
but age
which the cohort
a gen-
is
eral population rather than a selected
who
eased. Since age
group
are either healthy or dis-
an important prognostic
is
factor for mortality in a general population, the
crude death rate
of observed deaths per in that population)
cause
it
(i.e.
the
number
may be
number
of people
misleading be-
contains no provision for the pro-
portionate age distribution of the population.
Consequently, although populations
A
and B have the same age-specific death rates in young and old people, population A, which contains a large proportion of elderly people, might have a substantially higher crude death rate than population B,
which
contains
people.
To
predominantly
younger
avoid the bias contained in the
rates, we would therefore want compare the rates for A and B within younger and older strata of the population
crude death
to
architecture of cohort research
The
94
compari-
pite scientific clarity, these
do not have the statistical virtue of assessments based on a single initlmi strata
dex. Accordingly,
quest
in
epidemiologic
dex,
the
"correct"
such an
ol
regularly
statisticians
an
with
rates
age-specific
in-
"adjustment" that will allow the individual rates to be added together to form a single value. This type ol "correction"
standard rates"
sible for the "age adjusted
epidemiologic
are a mainstay ol
that
respon-
is
sta-
The adjustment depends on ages and deaths .uid
direct
method
the
The
rected."
details
yond the scope
the
death
of
multiplied
1>\
these
proportionate
the
be
to
is
the
for
population
observed
that
but
can
the)
age-strata
rates
ol
then
are
want
compare summary
to
age- or other subgroup-specific rates have
been studied carefully." Despite these caveats, the adjustment procedure has not only maintained its curpopularity
rent
epidemiologic reports,
in
often do not cite, and are not asked to indi-
which population was used for stanand whether the method of standardization was "direct" or "indirect." cate,
dardization
these
In
circumstances,
crude death rate and proportionate distribuof ages are known for the observed population: the observed age-proportions are then multi-
of the investigators
the death
lor
rates
each corresponding
death rate of the standard population; and the result is used as a multiplicative factor that "cor-
who
present such trans-
muted data and the remarkable
who
of the editors
publication,
1>\
the
makes
Bradford
clear
description
Hill
has
1
the
of
provided a particularly rationale
and good numerical
cedures
for
pro-
these
illustrations
ot
the
is
the
and
in-
calculations.
The main problem bias that
to
be noted here
inevitable in both the direct
is
our previous consideration noted the difficulties that can arise when disproportions occur in the multiplication of ratios whose products yield the incorrections.
In
of "transition bias,"
we
added together to produce a The same tvpe of sum of products
values
dividual
"rate."
of ratios occurs in these death-rate "corrections."
Since the "corrective" ratios depend entirely on the
population
particular
"standard,
that
the final results
is
chosen to be the vary dramatically
may
according to the contents of that population.
Elveback 13 have pointed out, "One of the major weaknesses of t. adjustment procedure lies in its lack of ui. ness, or in the influence which
As
Fox,
•
Hall,
and
credulity
accept the material for
absence of suitable dethe published results
neither interpretable nor reproducible. 4.
Serial time.
be considered called It
serial
The
in
fourth type of time to
cohort research can be
time or post-exposure time.
the length of time that has elapsed
is
rects" the crude death rate of the observed population.
who
reader
scientific
for
scriptions
age stratum ol the standard population; the sum of these products is divided into the crude
a
documentation of observations is unable to determine what was actually found in the observed age strata. Despite the unskeptical enthusiasm searches
the
final
and
should he undertaken until the
acteristic
tion
direct
"we do not
rates at all"
that "no adjustment for age or other char-
for
distribution
each corresponding age-stratum ol the standard population; and the sum ol these products is the corrected dr. ah rate. In the indirect method, only
plied
the desired compari-
ot
son." These" authors suggest that
cor-
the direct method, the
In
known
are
the data available
t
discussion,
this
"I
either
is
two methods are be-
the
ol
be outlined as follows: rates
"correction"
ol
population
observed
the distribution ol
selected "standard" popula-
a
in
or indirect according
the
for
on the outcome"
hut seems so well accepted that the authors
tistics.
tion,
the choice of a standard population exerts
serially
since
"zero
time,"
which
is
the
each member's exposure to the maneuver under surveillance. Unlike secular and generational time, which can inception
of
be readily identified and time
is
classified,
serial
often difficult to ascertain.
The ascertainment
presents no major problems in an experiment, since the in-
maneuver to the members of the cohort and can readily determine when the maneuver was applied, what was the initial state of each member before the exposure, and how vestigator
much
assigns
the
time elapsed until the occurrence of
subsequent events. When the experimental data are assembled to form the statistical rates of the target event, the investigator
can be sure that the virgule (or "fraction line") between numerator and denominator
95
Sources of 'chronology bias
signifies
the inception and subsequent per-
mem-
formance of the maneuver for each
the effects of a maneuver, the logical choice of a reference date is zero time: the incep-
maneuver. Such a date, which
ber of the population.
tion of the
however, the investigator docs not assign the maneuver, and may Rave many problems in deciding either what event to regard as the initiation of exposure to the maneuver, or when that
oriented neither to calendar nor birthday,
surveys,
In
took place.
event
may
time
In
cause-cohorts,
be the onset of a
zero
maneuver such
as cigarette smoking whose exact date of initiation can seldom be accurately re-
called.
course-cohorts, zero time
In
may
be the date of development of a disease whose specific date of onset can seldom be
In
discerned.
particular
mode
not be the
intervention-cohorts,
of treatment
may
or
a
may
course of therapy for the
first
The problems
and recogniz-
of defining
and the absence
ing serial time,
methods for analyzing
this
of suit-
crucial
chronologic feature of cohort research are responsible for fects
many major
scientific de-
contemporary coThese problems will occupy
Zero time and the inception cohort
statistics depend on the from an initial state to a subsequent state, each member of a cohort must have a chronologic reference point at which the "initial state" is identified, and from which the subsequent follow-up
cohort
transition
period begins. initial state at
span
or
course
clinical
at
lems noted
the choice of zero time
earlier,
will differ with different types of
maneu-
ver. If
the survey
is
an ontogenetic examina-
tion of natural events that occur in post-
natal
the choice of zero time
life,
ous. It
the date of birth for the
is
is
obvi-
members
For surveys in exposure to a cause of disease, a suitable zero time would be of
generation
a
cohort.
which the maneuver
is
The
trials in which an intervention with remedial therapy, zero
causal agent. For surveys or
the
maneuver
prophylactic
or
is
time would be the date of that intervention.
For pathogressive surveys
of
course," 11 the choice of zero time
"clinical is
compli-
cated by the examination of "natural his-
the remainder of this discussion.
Since
life
which he became exposed to the maneuver under surveillance. Because of the prob-
in the validity of
hort statistics.
B.
person's
particular time in a
the beginning of exposure to the alleged
selected disease.
able
would depend on the
is
characteristics
of the
that reference point are used
demarcate the people who are counted the denominators of the cohort, or in the denominators of any stratifications of
whose
tory" in a group of diseased patients
are combined, regardless of ther-
results
apy.
Since some patients will have been
"untreated" and since others will have re-
ceived
several
courses
of
the
therapy,
choice of zero time should enable the pato
tients
be mutually comparable
when therapy
at
a
allowed an opportunity to affect "natural history." Accordingly, an appropriate zero time for
date
is
first
to
each patient in such surveys would be
in
either the date of the
the cohort.
rected at the disease, or the date of the decision to give no specific therapy.
The chronologic
reference point
used to measure the duration of time elapsed before occurrence of any target events cited in the numerators. Since the reference point has so central a role in the statistical data, its improper choice for the members of a cohort can create
is
first
treatment di-
also
major, irremediable sources of bias. 1.
The choice
of zero time. Because the
puqiose of cohort research
is
to observe
Thus, a
study of therapeutic intervention for
in a
condition,
particular
zero
time could be
the
date at which the intervention began, but in a
study of "clinical course" for the same condition, zero time might be either the date of the therapeutic decision
intervention
against
specific
or
of
the
first
therapeutic
intervention.
For
ex-
ample, if we wanted to study the effects of radiotherapy in lung cancer, zero time would be the onset
of
the
radiotherapy for each patient.
To
The
96
architecture of cohort researt h
evaluate the results properly,
we might
then have
to stratify the initial state of the patients accord-
ing to surgical or other previous therapy, and also according to the "stage" of prognostic anticipations before radiotherapy began. On the other hand, if we wanted to study the clinical course
lung cancer, 11
of
mode
ploration,
time would be the
zero
of treatment
(selected
We
first
ex-
surgical
chemotherapy,
radiotherapy,
anti-neoplastic therapy).
among
no
or
would then need
each person's exposure to the maneuver under study. The choices did not depend on age or on secular aspects of the calendrical interval in which the cohort was chosen. These secular boundaries, however, determine the initial "parent population" from which the investigator selects the "candidate population" for his cohorts.
to
mainly according to prognostic stages, since therapy had not jet occurred. stratify the initial state
In any type of cohort research, the
appropriate
For the
course of certain
clinical
ments, the date of diagnosis
same
as the date of the
the
since
cision,
first
is
ail-
often the
therapeutic de-
two events mav occur
concomitantly. Thus, a diagnosis of rheumatoid arthritis is usually associated with a simultaneous decision about
its
therapeu-
members
are assembled because they appeared at an site
during a particular ealen-
drical interval in
interested.
If
which the investigator
is
the research deals with the
D found during 1955-1959 H, the cohort will be assembled from people noted to have the disease during that secular interval at the cited course
of
disease
at Hospital
hospital. If the research deals with the ef-
after diag-
smoking in people who were healthy adults during 1950-1954, the cohort will be assembled from the healthy adult cigarette smokers who answered ap-
nosis until specific pre-thcrapeutic require-
propriate questionnaires during 1950-1954.
tic
management, and
so the date of diag-
nosis might be an appropriate zero time. In
other circumstances,
however, the choice
may be delayed
of treatment
ments are evaluated. Thus, the treatment chronic infections
of
sults
may
re-
of tests for the infecting organism's
sensitivity to antibiotics,
of a cancer bility
await the
and the treatment
may be delayed
until the possi-
and extensiveness of metastasis are
fects of cigarette
Despite this
first
therapeu-
calendrical secularity
major calendrical disparities
in
the
date
which the individual zero times
at
oc-
curred.
The people who
determined. In the case of the infection
and the cancer, the date of
common
manner of assembly, however, the members of the cohorts may have had in their
particular
are uni-secular for a
maneuver are not
necessarily uni-
may not begin until an acute ailment has led to the detection of the disease. For example, in dia-
maneuver. Thus, in a causecohort, the adults who were cigarette smokers during 1950-1954 had not all begun smoking during that time. Some people had started smoking many years before 1950-1954, whereas other people had
betes
begun more
tic
decision
is
often the best choice of zero
time for pathogressive surveys of "clinical course." For certain chronic diseases, the study of "chronicity" after
nosis
mellitus,
the patient's
initial
and treatment may occur
diag-
in the hos-
during an episode of acidosis, and the conventional plans of daily regulation pital
at
home may not be
established
until
afterward. Consequently, for a survey of
the clinical course of diabetic patients
first
diagnosed in a hospital, 17 a good choice of zero time might be the date of discharge all
the c
time
of the situations just described, es
\\
of zero time to begin serial
oriented
to
the
recently. Similarly, in a course-
cohort, the people
who
are admitted to the
same hospital with the same disease during the same calendar interval have nevertheless had the disease for varying lengths of time. For some patients, the therapeutic maneuver assigned during that admission was the first course of treatment for the disease, but for other patients, the treat-
ment may have been a second,
after that hospital admission.
In
serial for that
inception of
third,
or
later episode of therapy.
In addition to this problem of uni-seriality
for the maneuver, a
problem of
uni-
Sources of 'chronology bins'
group of patients asa hospital or other medical
zonality exists for a
sembled
at
The
center.
was
multi-zonality
of
bias
97
during the cited secular interval. The inception would exclude people with lung cancer
cohort
who were
in one or both of the two "index" hosduring the cited interval, but whose first treatment had taken place at some other hospital or at a date before 1953 or after 1959. pitals
considered previously in reference to transmigration for a general' population in a particular region.
The
rate of subsequent
According
deaths or other events in the apparent co-
be distorted by disproportions the rates with which healthy and sick
ception cohort
in
uni-secular.
people enter or leave the region. Analogously, the rates of post-therapeutic events
diseased population at a particular
for the
may be
hospital
biased by the patterns of
admission and referral to that hospital. For example, a large medical center that provides special types of radiotherapy or chemother-
may
apy for cancer patients
who
surgical
successful"
"satellite" hospital. If
surgical
receive
many pre-moribund
are referred after having
treatment
at
a
had "un-
neighboring
data for these referred post-
whose surgery was first performed at the medical center, the combined post-surgical results may seem worse at the medical center than at the
2. The inception cohort. The diverse forms of bias that can result from these dif-
and zonal migration after zero time can be avoided if the group under study is an inception ferent patterns of serial exposure
cohort.
An
inception cohort consists of a
group of people in
whom
"zero time" for
maneuver occurred at the same investigative locale and during the same secular interval. the investigated
Thus,
maneuver
the
if
is
the ontogenetic risk
1967 for the general population Connecticut who were 45-64 years
of death during of people
in
age on their preceding birthday, an inception cohort would consist of the people with that age and in that region on January 1, 1967, reof
gardless
of
whether any members of the cohort
moved away
subsequently 64.
The
inception cohort
or became older than would exclude any ap-
first
course
of
1967.
1,
the
If
treatment
is
the
1953-1959 for either of two "index"
during
patients with lung cancer at hospitals, 11
maneuver
an inception cohort would consist of those people whose first course of treatment took place at either one of those two hospitals all
in the
denominator
condition at zero time,
refers
when
the
maneuver began; the cohort is uni-zonal in that the maneuver for each member was initiated in the particular zone (city, state, hospital, or
other medical set-
which the cohort was assembled;
ting) at
and the cohort
uni-seeular in that the
is
uni-zonal zero time took place
uni-serial,
within the secular interval of the assembly.
The are
bias
that can occur
multi-zonal
when
multi-secular
or
cohorts
has
al-
ready been discussed. Such bias is caused by the "transition" problems described earlier, and the bias can be reduced or eliminated with appropriate attention to stratifications and other suitable procedures that
can separate zonal or secular
j)rog-
nostic distinctions within the observed co-
The remainder
hort.
devoted cohort serial,
of this discussion
to the bias that
multi-serial
is
when
or
a
is
can occur when a radier than unicohort
uni-serial
is
by subsequent elimination of some of its starting members. Such bias cannot be removed by stratification or by any other procedure of taxonomy or mathe"aborted"
matics.
The
bias
is
irremediable
if
the ob-
served cohort has been permanently dis-
by the omission of unknown numbers of people whose data are disparate from the reported results.
torted
aged "immigrants" who moved to Connecticut during the year, or people already living in Connecticut who became 45 years old January
his
studied
propriately
after
cohort
the
of
in-
and
uni-serial, uni-zonal,
The members
member
for each to
tients
"satellite."
is
are uni-serial in that the initial state cited
are added to the data for pa-
patients
an
to these specifications,
hort could
C.
The bias of "survival" cohorts
Unless a uni-secular cohort to
is
restricted
members whose zero time occurred
ing
that
members
secular
interval,
will actually
many
dur-
of
the
be survivors of other
cohorts with zero times that occurred secularly at
an earlier date. For example,
if
we
The architecture
98
of cohort research
decide to study the subsequent clinical course of a group of patients noted to have lung cancer at a particular hospital
during 1953-1958, any such patients whose treatment occurred during 1949-1952 are sunivors of the earlier cohort. It the first
survival rate for tin-
same
as that
newly treated patients is lor those who have already
survived several years alter
we need
then
rates
survival alive
at
treatment,
and the "new" groups. But
"old"
the
first
attempt to distinguish
not
are
times
serial
different
/eio
alter
make
can be
disastrous.
the distinctions
the long-term
people people
who who
rate
survival
die earl)
is
(
is
are dead aftei
r.
i|
1
in;
The number the number
who
people long-term
ol
die
dates
=
Since the short-term deaths have already occurred
people with disease D, the assembled cohort
be drawn from (jn people, rather than n people in whom the disease began. Of those qn will
(q-r)n
people,
vestigator
die
will
leaving
rn
calculates
during
survivors.
the
short-term survival rate. If an investigator compares two cohorts that have the same long-term survival rate of r, but
short-term survival
different
and
he
if
period,
of
ratio i
survival
the
long-term
When rate
the for
two cohorts
=
in-
his
observed cohort, he divides the survivors 1>\ tinnumber of people with which he began, and he obtains rn/qn r/q. If he had started with a inception cohort,
vival rate
a
"survival
survival rate
To
cohort,"
by a
illustrate
numbers,
let
he
r,
has
By
using
biased
the
true
.20
.20
=
50%
for the
first
cohort and
.40
80%
=
25%
.80
for
the second.
the
first
The observed long-term
as
that
rate for
cohort will therefore seem twice as high the second cohort, even though
for
the
two cohorts are actually the same. Now suppose two cohorts contain a mixture of two subgroups with different short-term survival rates, q, and and the same long-term rate, r. Suppose the proportion of people from the first subgroup is k* in Cohort A; the proportion of people from the second subgroup in Cohort A Suppose the corresponding prowill be 1-1ca. portions in Cohort B are k B and 1-1cb. Because the long-term rates in the two subgroups q-..,
are the same, the long-term results in Cohorts
and B should be the same
A
they were selected
if
as inception cohorts, regardless of the proportions
ki
and k n
.
If
the cohorts were selected after a
post-inception
short-term
however,
interval,
the
observed survival rates will be as follows:
For Cohort A, R A
+ (i
with
the short-term
=k — A
f
]
not r/q.
phenomenon
if
one cohort and
for
be
will
be calculated for the two cohorts are
will
factor of 1/q. this
The
r/cjj.
rates
for the other cohort, the spurious long-term rates
however, the correct sur-
would have been
q-..,
both cohorts have a
20%, but
40%
is
if
=
true
and
c\,
he r/q, and
long-term
Thus,
q./q,.
long-term survival of rate
will
spurious
these
r/q;
:
\
trial.
This difficulty regularly occurs for pa-
been
in the diverse clinical spectrums found during therapy for many acute medi-
discussed leads us to the clinical point that
cal ailments as well as for all chronic dis-
'Tin
is
tients
point
statistical
that has
just
the unalterable, cardinal flaw of
all trials
performed with randomized allocation of therapy. The results are highly limited. They may not pertain to anyone other
than the group of people
dom sample under
the
of
treatment,
who
participated
people are not a ran-
in the trial. If these
disease or
the
results
condition
cannot
be
eases
—such
diabetes
as
coronary and artery disease. We cannot be sure that the clinical characteristics and spectral distribution of patients with such ailments as
pneumonia,
bowel
functional
for a private referral clinic as for a
ward
We
this
that the tvpes of patient treated
him
who performed
the
those of Dr. B,
who
is
extrapolation that concerns
clinical. If
the treated population
a random sample, represent?
And
to
whom
was not
did the patients
what kind
of patients
can the results apply?
Amid
many randomized therapeutichave been performed in the past few decades, I do not know of a single investigation in which the patients under study were chosen randomly. The participating patients have been the "chunk samples" of people found in hospitals, out-patient links, doctors' offices, and other sites conven -nt to the investigators. The partrials
the
that
ticipants lave never been chosen randomly from the parent population of patients who
sults.
In
service.
trial,
same mu-
cannot be sure
nicipal
but a clinician may not care about
The
distress,
stroke, or inoperable cancer are the
validly extrapolated for statistical purposes,
issue.
cancer,
mellitus,
emphysema,
pulmonary
by Dr. A,
are the same as
plans to apply the re-
some circumstances, we cannot
even be sure that the
results
found
one
in
subset of patients are pertinent to what
was noted
in
another subset of patients
trial. (The major contrast can arise for patients in the same multi-center trial was demonstrated in the
within the same that
UGDP
study,
gardless
(1/87)
in
where the
fatality rates, re-
from one participating center to
of
therapy,
ranged
1% 26%
(23/90) in another.) To be able to apply the conclusions of a clinical trial reproducibly, clinicians
have satisfactory data
must
to identify the post-
Credulous idolatry and randomized allocation
therapeutic results found
the different
in
The
kinds of patients under treatment.
cannot be suitably identified
tients
if
pa-
they
described only according to a diag-
are
nostic label that does not denote important
and
fore
cogent
the
of
The
after treatment.
crucial role
described
stratifications
and previously 12 is to permit these distinctions to be designated. When such earlier
stratifications are absent, a clinical
way
has no
reader
knowing whether and how
of
the results of a
Even when the
trial
can be extrapolated.
stratifications are well per-
may still be difficult to was designed with overly
formed, the results
apply
if
rigid
admission
the
trial
created
that
criteria
a
"pure-disease sample," a "compliance sam-
some other highly
agencies and other
for the regulatory
in-
members of the medical community who make decisions about the efficacy, safety, and applicability of new (or old)
fluential
forms of treatment.
distinctions of their condition be-
clinical
119
1.
Issues in efficacy.
A
may
single trial
demonstrate that an agent is efficacious, but cannot prove the absence of efficacy. As
shown in the earlier example of an agent that worked well for 1 of every 25 people, the trial may not have included enough members of the strata of "responsive" pa-
A
tients.
trial
may
also give a falselv "nega-
because of the "cancellation" created by opposing effects in different strata, or because of other forms of insensitive design, as discussed bv Cromie. Even tive"
result
7
if
efficacy
was demonstrated, the
results of
group
the investigated group or of other patients
that differs greatlv from the patients en-
must be carefully analyzed to determine whether the efficacy pertains to all the
ple," or
restricted
countered in ordinary clinical practice.
—
For both these reasons the clinically constricted spectrum of the admitted patients and the failure to perform suitable stratification of those who were admitted many randomized trials have been performed as a type of in vitro experiment that must be followed by extensive in vivo tests.
The
may show
trial
the relative efficacy
and safety of the therapv for the group of people under scrutiny, but the results may not be readily interpretable or applicable
Even when the data number of pa-
to patients elsewhere.
are suitably stratified, the
may be
tients in the strata
satisfactory to
plications,
more common-
results
been excluded from the during
trial
clusions are zealously
of trials. tific
The
or inadequate-
clinical
problems
when sweeping
con-
drawn from either number
or a relatively small
careful formation of scien-
conclusions
is
can prove that an agent is safe. The harm no show only that the agent brought to the people who received it. The agent trial
may
not have been maintained long enough
for toxic effects to develop, or the trial
may
not have included the particular kinds of patients for
whom
the agent
example, in a clinical
trial
is
unsafe. For
confined to
men
and post-menopausal women, thalidomide might readily be shown to be an effective, harmless sedative. Conversely, of safety
is
when
implied in a clinical
lack
trial,
and whether the reactions occurred the treated
the
particularly
important
patients
or
just
in
in all
certain
susceptible strata. 3.
Issues
in
applicability.
For
all
foregoing reasons, the ultimate safety, ficacy,
These intellectual and
trial
in the
who may have
it.
are regularly ignored
a single
noted when
becomes used
or
diverse types of patients ly identified
can
clinical trial
termine whether the "adverse reactions" were actually due to the indicated agent
must rely either on
on surveys of the
was
No
Consequently,
clinicians
the treatment
Issues in safety.
inteqMctations and ap-
additional clinical trials or, ly,
2.
data must be carefully examined to de-
generalization.
make appropriate
too small for
treated patients or only to certain strata.
and
applicability
of
the ef-
therapeutic
cannot be established only with randomized clinical trials and must be determined from supplementary data,
agents
results of a single trial or of a fe v trials
cannot
possibly
types of patients
encompass the and reactions
t
The architecture
120
of cohort research
be encountered when a therapeutic agent comes into general use. The trials can serve
new agent should be allowed into the clinical "market" or that an old agent should be reappraised but the trials do not necessarily allow
Cobb, L.
3.
Colton,
only to indicate that a
information of general clinical experience
(with or without randomized
trials)
4.
The
5.
Cornfield, S.:
imposed during the
generalize the results of
to
abilit)
would be greatly enh. lined if better techniques were developed for identifying and stratifying clusters of
7.
An
Many
teristics.
safety
and
solved
if
Assn.
58:
Group Diabetes
University
further statistical J.
A.
analysis
of the
M. A. 217:1676-1687,
Halperin,
J.,
and Greenhouse,
M.,
Am.
cal trials,
J.
Cox,
R.:
D.
Stat. Assn.
clini-
64:759-770, 1969.
Randomization,
Biometrics
27:
efficacy could be readily rethe anecdotal opinions reported
non-random
dence assembled
in
clinical
8.
trials
were
improved
all
and an analysis of
within
results
Feinstein,
A.
Clinical
R.:
A.
Feinstein,
An
III-V.
biostatistics.
R.:
Clinical
(UGDP)
Program
Diabetes
VIII.
biostatistics.
analytic appraisal of the University
Group
study,
Clin.
Pharmacol. Ther. 12:167-191, 1971. A.
10. Feinstein,
R.:
Clinical
biostatistics.
IX.
How
do we measure "safety and efficacy"? Clin. Pharmacol. Ther. 12:544-558, 1971.
to allow
a specification of distinctive clinical strata
Med. 59
Roy. Soc.
755-771, 1970. 9.
imprecise mixtures tabulated in ran-
domized
providing data for
in
Proc.
analysis,
The architecture of clinical research, Clin. Pharmacol. Ther. 11:432-441, 595-610, and
evi-
and
surveys,
Cromie, B. W.: Errors (Suppl.): 64-68, 1966.
arguments about
current
as clinical experience, the
A.
11. Feinstein,
On
those
of
Clinical
R.:
ghost
the
exorcising
curse strata.
Kelvin,
Clin.
biostatistics.
XII.
Gauss and the Pharmacol. Ther. of
12:1003-1016, 1971. •
•
12. Feinstein, A. R.: Prognostic stratification series,
•
will continue to
be an
essential in-
Assuma trial have manner, ran-
ing that
all
domization
other plans for in a satisfactory
within
(preferably
marcated prognostic strata)
science
randomization continues to be
re-
The
The References S.:
Controlled studies in clinical cancer research,
Med. 287:75-78, 1972.
in
biostatistics.
XIX.
the twelve different
18.
R.:
Clinical
biostatistics.
XXII.
randomization in sampling, testing,
role
A.
R.:
Clinical biostatistics. XXIII.
of randomization in sampling,
test-
and credulous idolatry (Part 2), Clin. Pharmacol. Ther. 14:898-915, 1973. Fisher, R. A.: The arrangement of field exing,
J.
Clinical
and credulous idolatry (Part 1), Pharmacol. Ther. 14:601-615, 1973.
17. Feinstein,
be neglected.
A.
role of
allocation,
N. Engl.
R.:
A. R.: Clinical biostatistics. XX. The epidemiologic trohoc, the ablative risk ratio, and 'retrospective' research, Clin. Pharmacol. Ther. 14:291-307, 1973.
Clin.
and Lee,
A.
14:112-122, 1973.
and if all of the more important and scientific components continue
B.,
humanised Lancet 2:
15. Feinstein,
clinical
J.
for
medication,
concepts of 'control,' Clin. Pharmacol. Ther.
progress,
Block,
The need
Ambiguity and abuse
16. Feinstein,
C,
442-
421-423, 1972.
as the principal ingredient in that
Chalmers, T.
R.:
evaluating
in
14. Feinstein,
well-de-
method for allocating the sequence of therapy. The idea of randomized allocation will retard therapeutic progress, howif
A.
13. Feinstein,
offers the best
current
garded
13:285-297,
457, 609-624, and 755-768, 1972.
gredient in therapeutic progress.
been made
Pharmacol. Ther.
Clin.
In the meantime, however, randomized
1.
J.
one of two
Stat.
adaptive procedure for sequential
statistical
trials
patients with distinctive prognostic charac-
to
Amer.
1114, 1971. (Abst.)
randomized
ever,
for selecting
1971.
6.
trials
The
J.:
A
mortality findings,
trials
the
Cornfield,
Program.
glimpsed, and sometimes distorted, by the investigative artefacts
treatments,
388-400, 1963.
can
provide a satisfactory view of the broad scope of human illness that is merely
A model
J.:
medical
Onlv the additional
definitive conclusions.
A., Thomas, C. I., Dillard, D. H., Merendino, K. A., and Bruce, R. A.: An evaluation of internal-mammary-artery ligation by a double-blind technic, N. Engl. J. Med. 260: 1115-1118, 1959.
2.
allocation,
periments, 1926.
J.
Ministry Agriculture 33:503-513,
Credulous idolatry and randomized allocation
19. Fisher,
8,
New
R. A.:
of experiments, ed.
chance occurrence of substantial initial differences between groups in studies based on
York, 1966, Hafner Publishing Co.
W.: The physical
20. Heisenberg,
quantum Chicago 21. Landis,
The design
theory, Chicago,
random
principles of the
1930, University of
27.
Press.
and Feinstein, A.
R.,
J.
R.:
An em-
comparison of random numbers acquired by computer-generation and from the Rand tables, Comp. Biomed. Res. 6:322-326, D.
V.
:
Is
randomization necessary?
fluencing
plots,
evaluation
of
drugs.
lar
J.
A.
M.
With
A. 167:2190-2199, 1958.
Nebenzahl,
E.,
sampling for a fixed-sample-size binomial lection problem, Biometrika 59:1-8, 1972. 26. Radhakrishna,
S.,
and
Sutherland
I.:
se-
S. Cosset): Comparison between and random arrangements of field
Biometrika 29:363-379, 1937.
Group Diabetes Program: A study
complications in patients with adult-onset
I. Design, methods and baseline reand II. Mortality results, Diabetes 19: 747-830 (suppl. 2), 1970. Youden, W. J.: Randomization and experi-
mentation, Technometrics 14:13-22, 1972.
31.Zelen,
M.:
Play
controlled clinical
The
infarction.
Med. 281:115-119,
sults;
30.
and Sobel, M.: Play-the-winner
J.'
diabetes.
special reference to the double-blind technique,
25.
myocardial
of the effects of hypoglycemic agents on vascu-
D.:
clinical
after
N. Engl.
(W.
29. University
Elementary medical statistics, ed. 2, Philadelphia, 1963, W. B. Saunders Co. Modell, W., and Houde, R. W.: Factors inMainland,
11:47-54, 1962.
1969. 28. Student
balanced
Biometrics 27:1114, 1971. (Abst.)
24.
prophylaxis
Final report,
1973.
23.
allocation, Appl. Stat.
Seaman, A. J., Griswold, H. E., Reaume, R. B., and Ritzmann, L.: Long-term anticoagulant
pirical
22. Lindley,
121
64:131-146, 1969.
the trial,
winner rule and the Amer. Stat. Assn. J.
CHAPTER
9
Consequences of 'compliance
Most discussions
cemed with
nl
compliance are con-
maintenance of
patient's
a
a
drug, diet, or other agent oi
work unless
therapy can
newsletter that contains an ongoing ac-
count
an assigned therapeutic regimen. Since no
bias'
In
ot
new developments
trying
pliance
with
in the field.
achieve or increase com-
to
a
therapeutic
regimen,
we
main studying compliance is
begin with the basic assumption that the
By finding out why patients fail to comply and how we can encourage compliance, we hope to develop better ways of enabling a presumably beneficial therapeutic regimen to accomplish its benefits. With these goals in mind, we may
and worth using. Once these virtues have been established, the regimen will warrant the efforts by both medical personnel and patients to ensure that it is mainthe complex spectrum of compliance has
investigate
effects
it
taken, one of the
is
clinical reasons for to increase
it.
the
various
clinico-socio-
therapy
desirable
is
—that
it
is
safe, effec-
tive,
tained as prescribed.
On
the other hand,
that can alter the basic data ana-
behavioral features that are determinants
lyzed to determine the virtues of a regimen.
compliance and the educational-com-
These ramifications of compliance are the
of
may
municational-packaging features that
enhance
A
substantial
develop on
references
only a few of the 18,
-
1
specifically
a detailed
has
literature
begun
to
compliance with therapeutic
The
regimens.
focus
discussion
of
in
this
essay.
suitable biostatistical attention
it.
cited
many
devoted
compendium 15
here are
studies
1
-
"
to this topic,
1L'-
and
of the literature
was recently assembled for a major "Workshop/ Symposium" conducted at the McMaster University Medical Center in HamilOntario, Canada. The organizers of
ton,
that meeting, Drs.
David
R. Brian Haynes, have also
L.
Sackett and
begun
to issue
the
diverse patterns
pliance,
it
and
is
effects
Unless
given to of
com-
can act as a source of confusion
and distortion
in
the original therapeutic
data. If various forms of "compliance bias"
and adjusted, regimen may be dismissed as worthless or an ineffectual treatment may be promulgated as good. In this essay, I should like to discuss six features of compliance that can affect the interpretations. biostatistical data and These features relate to issues in regimen are not properly recognized a
valuable
therapeutic
compliance, the evaluation of compliance, This chapter
i
iginally appeared as "Clinical biostatistics
XXX.
Biostati
Clin.
Pharmacol
122
problems in 'compliance Ther. 16:846, 1974.
teal
bias.'"
— In
the non-compliance "control", protocol compliance, the "compliance sample",
"compliance-confounded cohort".
and the
Consequences
Regimen compliance
A.
123
'compliance bias
oj
we would be unaware causing the difference. We would then conclude erroneously that Drug
compliance, however,
The first point to be considered is the way that the results of a therapeutic regi-
of
men can be
B is pharmacologically more effective than Drug A, although the actual cause of the
distorted by the compliance it Let us assume that an index of success has been established in a randomized, double-blind therapeutic trial, comparing Drug A vs. Drug B. Let us further assume that the two drugs are receives.
equally effective. Despite this equivalence
compliance bias can alter the results of the trial so that a major difference can falsely occur in the success rates in efficacy,
two drugs. The observed success 50% for Drug A and 66%
the
of
might be
rate for
Drug
A
B.
magnitude could Suppose that the con-
false difference of this
occur as follows:
role
its
in
difference arose as a matter of compliance,
rather than pharmacologic efficacy.
The mathematical simplified
calculations just cited can be by the following algebraic procedure.
Let pn be the proportion of patients who maintain good compliance for Treatment D. Let qn, which is 1-pn, be the proportion of patients who do not maintain good compliance. Let r, be the rate of an outcome event (such as "success" or a cardiovascular complication)
good compliance. Let
patients
in
who
maintain
be the rate of this outpatients whose compliance was not
come event in With these
good.
r:
conventions,
the
rate
the
of
outcome event for the cohort of patients receiving Treatment D is p n + q D r2 Thus, in the foregoing .
under treatment has
dition
when
rate
either
Drug A
drug
is
Drug A
that
B
success
when
The
success rate for
abandoned. Now suppose has an unappealing taste,
.30)
=
30%
success rate
50%
maintained by only it, whereas receives excellent compliance by
it
Drug B
90%
is
faithfully
of the patients assigned to
With these distinctions, if 200 patients were assigned to Drug A, 100 patients would maintain the drug faithfully and 70 of these 100 would have a successful outcome. Of the 100 patients who abandon the drug, 30 would be successful. The net result for Drug A would of the patients.
be 100 successes per 200 patients all
180
success of
the
maintain
—an over-
50%. With Drug B, 200 assigned patients would
rate
of
compliance
excellent
and
126
them would have a successful outcome. Of the 20 patients who abandon the drug, 6 (30%) would be successful. The total success rate for Drug B would thus be 132/200, or 66%. The difference between the two success rates would be large enough 16% ) to be clinically significant and the magnitude of the sample sizes cited here would also make the difference "statistically signifi2 = 10.5 If we had cant" at P < .005 not attempted to investigate and analyze
(70%)
of
(
)
(
.
=
(.50 x .30)
.70)
is
appearance, or schedule of administration so that
example, the success rate for Drug
faithfully
maintained and a either
or
70%
a
+
The
.35 + .15
Drug B
is
(
=
effect of
compliance bias
ation just described
A .50
is
(
=
.90 x .70) +
.63 + .03
.66
=
=
.50 x
50%. (
.10 x
66%.
was
to
in the situ-
cause a false
difference in the apparent efficacy of
two
drugs that actually had equal pharmaco-
An analogous set of problems might work in the opposite direction to produce a false difference in the adverselogic action.
reaction
rates
of
two drugs with equal
toxicity.
B.
Evaluation of compliance
Since distinctions in compliance can affect
the
appraisal of a therapeutic regi-
men's efficacy and safety, the evaluation of compliance is an important, although often neglected, aspect of clinical biostatistics.
neglect
Probably the main reason for this is that compliance is an entirely
subjective
and
human phenomenon. "soft". The degree
data are extremely
compliance
Its
of
determined by the patient; the act of compliance usually occurs in circumstances where it cannot be directly observed by the investigator; and its appraisal depends on what the patient decides to do and report about it. In an era devoted to the analysis of "hard" dat compliance is a variable that lacks scienti is
fi
The architecture
124
appeal, no matter
how
of cohort research
important the phe-
nomenon may be. The prejudice against
this type of information has been well summarized b)
Konrad Lorenz15
the recent Nobel laureate,
:
If the subject of investigation happens to be human, he or she is being literally dehumanized by being prevented from showing an) response which a guinea pi t^ or a pigeon might not show as well (in fact, tin- same experimental set-up is often applicable to animal and human subjects).
Worse,
kind
in that
menter himself is not permitted to be quite human, he is strictly prevented bom using most of the
cognitive mechanisms with which nature .
.The worst of
.
pliance, a well-constructed interview tech-
nique cannot be replaced by any of the available "objective'' methods. Furthermore, a
patient
who
deliberately wants to dis-
guise the truth can subvert the "objective"
procedures as well as the information provided
in the interview.
experimentation, the experi-
ol
as
our species.
can learn the qualitative and quantitative information that is not provided by the other two techniques. For getting the totality of data needed to evaluate com-
endowed
widespread condiscourages people really complicated
this
tempt tor description is that it from even trying to analyze systems
The
interview
technique,
patently subjective.
It
however,
is
depends completely and reliability and
on the patient's recall on the skill of the interviewer. An
also
who
interviewer
manner patient
is
punitive
is
or
whose
otherwise unacceptable to the
may
not
elicit
accurate data. For
Types of data. For investigators who
these
reasons,
arc willing to cope with complicated hu-
often
used as the prime source of com-
man
pliance data,
/.
systems, three different methods can
be used for getting data
to
describe and
techniques
evaluate compliance. Perhaps the most ob-
reliability.
measure the presence of the drug (or one of its metabolic products) in the urine. The main disadvantage of this method is that it pertains only to the one specimen for which the test was made. The result does not indicate the patient's compliance during all the other times for which no urine tests were performed. Furthermore, the single urine test may not be able to indicate whether the drug was correctly taken in the prescribed pattern even on the day of the
truly
jective
technique
to
is
patient
compliant
maining doses visit.
to
returned for
is
be counted
its
re-
at the next
This technique provides quantitative
data, but
it
will fail
if
to bring the container,
the patient forgets
and the
"pill
count"
cannot demonstrate whether the medication
was taken
in
the desired pattern or
disposed in various unprescribed manners.
The tient
best
way
has done
rectly.
Fro
i
of finding out is
the
to
what a pa-
ask the patient di-
reply,
an
investigator
as
information
consistent with
is
test.
2. Types of rating. Regardless of what method is used for assembling data about compliance, the results must be cited in a manner that allows the data to be ana-
lyzed.
cannot
compliance
Since
be
ex-
pressed in dimensional terms (which might
be used for such variables as height or serum cholesterol), the investigator must choose a scale that provides a rating. The is chosen can be a dichotomous
partition (such as
given a fixed number of doses
the
if
the results of the objective
A second quantitative technique is based on dosage unit counts. At each visit, the is
would be regarded
stated in the interview
scale that
in a container that
to verify the patient's
only
test.
patient
is
but one of the "objective"
added
is
A
interview technique
the
good and not good) or
a set of ordinal ranks, containing such cate-
and excellent. For many analytic purposes, a dichotomous gories as poor, fair, good,
partition will suffice, since the investigator
may want
to
engage only
in a
simple com-
parison in which the outcomes of the good
group or not-good group are contrasted against
all
others.
The choice
of criteria for these ratings
be affected by the type of regimen under study and the purpose for which it has been prescribed. For example, the criteria for good compliance may differ will obviously
Consequences
if
a daily antibiotic
is
being taken
to pre-
125
of 'compliance bias'
therapeutic decision.
the experiment suc-
If
vent rather than to eradicate an infection.
ceeds, the doctor has learned that success
To
does not always require the original plan
illustrate
distinction,
this
shall
I
list
here two sets of criteria used during investigations
'~'
of daily oral
;
antibiotics in
Group A streptococcal and rheumatic recurrences in a population of children and adolescents who had all had at least one previous episode the prevention of infections
of acute
rheumatic fever:
drug
Purpose of regimen
who
to
who
patients
received the treatment.
Nevertheless, in ordinary clinical prac-
who
a patient
appears to reject the
200,000 units
400,000 units 3
daily
times daily for
10 days
medical
G
Continuous pro-
Oral Penicillin
G
Eradication of
doctor's
may
attention
miss
elsewhere,
opportunity
the
doctor
the
the
learn
to
phylaxis against
streptococcal
results of the counter-experiment. This type
streptococcal
infection
of loss
Reliable history
Reliable history
"good" compliance of
patient
a
comply with an offered treatment becomes a type of "control" whose results can be compared with those of
tice,
infection
Criterion
Consequently,
action.
refuses
recommendations is often rejected doctor. Because the patient may by the then be actively or passively urged to seek
Oral Penicillin
Dose of
of
and no more than and no more than dose missed
5 days missed per
1
month and no 2
during the 10
days missed con-
days
may be a necessary event in circumstances where a busy practitioner wants to use his time "efficiently" and has no intention of ever tabulating his therapeutic
The
results.
however,
loss
the
if
become items
secutively
An Regardless of whether the reader agrees or disagrees with the details of these cri-
they demonstrate the fact that
highly
is
undesirable,
practitioner's
data
ever
of biostatistics.
illustration of the
problem has regu-
appeared in surveys reporting the outcome of therapy for cancer. In such larly
surgically
treated
differ-
surveys,
the
ent types of therapeutic regimens will re-
patients
have
quire different criteria for ratings of com-
against
pliance.
radiotherapy or chemotherapy. This comparison is unfair because the non-surgical
teria,
C.
The non-compliance "control"
results
those
of
been
generally of
compared
who
patients
received
reasons that compliance has been so ne-
were not an "operable" group; deemed "inoperable" and referred for other modes of treatment. For an unbiased comparison, the group of "op-
glected as an important variable in
erable"
Although the talking
to
patients
scientific prejudice against
patients
is
one
of
the
main
statisti-
they were usually
patients
be
should
"operable"
valuable data related to compliance. This
other treatment.
prejudice dismiss
is
the
patients
tendency of doctors to
who
reject
the
doctors'
recommendations. fails to
carry out the doctor's recom-
mendations is performing an important experiment that the doctor was unwilling to
undertake.
in
effect,
to
The
test
the experiment
patient
has
decided,
a counter-hypothesis. If
fails,
the failure helps sup-
port the propriety of the doctor's original
patients
received
with
who
a
surgery
group
received
of
some
This type of contrast would be delib-
arranged
erately
peutic
In ordinary clinical practice, a patient
who
who
contrasted
another type of prejudice has caused clinical investigators to lose other cal analyses,
trial,
in
a
randomized theratrials have
but very few such
been conducted for "operable" patients. Consequently, the only source of "operable" non-surgical patients in ordinary clinical
who were deemed and who refused the offered surgical treatment. These "non-compliant" patients would be a reasonable contro'
practice
is
the people
"operable"
group for the surgically treated patio
The
126
architecture of cohort research
but the non-compliant patients are usually
Compliance by
/.
investigator. The like-
rejected by their surgeons and seldom re-
lihood of violating a research protocol
ceive follow-up examinations for their out-
particularly high
comes to be noted, recorded, and analyzed. Non-compliant patients can also serve as an important "control" group in circumstances w here the results of plaeeho therapx
tors are
when
are not available. For example,
agents
antibacterial
eral
randomized value
sev-
were receiving compare their
clinical trials to
preventing streptococcal infections.
in
because of poor communication
arise
ly
among
the collaborators or inadequate at-
by
tention
Iniiuent
for diabetes
patients did not
examination of results in groups treated with placebo. In the absence ol a placeb itreated group, however, major problems
dards
when two
found
Was
to
of the "active" drugs
yield essentiall)
drug
were
similar results'.
more
fulfill
betes mellitus. Aside from the ethical prob-
lems
produced by
violation,
this type of protocol can create a major statistical
it
Streptococcal attack rate in compliant
pa-
different.
tients
i
who maintained "good to
m
patients who tailed to comply. Another opportunity to make analyticuse of compliance distinctions occurred during an investigation of the role that tonsil si/e might play in predisposing rheu-
matic children and adolescents
Among
to strepto-
the patients
good continuity
maintained
who
antibiotic
of
prophylaxis, the attack rate of streptococcal infections
Among
was unaffected by
patients
prophylaxis,
who
tonsil
size.
did not maintain good
the streptococcal
attack rate
increased with increasing size of the tonsils. D. Protocol
compliance
In
issues
all
the
idea of compliance
discussed so referred
Another type of
prophylaxis")
be substantially lower than
coccal infections.'
stan-
the "ineligible"
in
-
placebo? The issue was resolved
was found
admitted
minimum
glucose intolerance that had been established as diagnostic criteria for dia-
problem if the results and "eligible' patients
effective than
of therapy
''
ol
the
really
1
of the
(
One
criteria
For example,
the
when
either
the
in
trial.
mellitus. \Y
ethical considerations militated against the
anise
the
to
investigators.
occurs
UGDP cooperative study
the
in
individual
violation
admission
lor
is
when multiple investigacollaborating. The violations usual-
far,
the
only to the
inadvertent,
ment
violation,
which
is
often
the investigator's develop-
of the ability to discern the identities
drugs being studied in a doubleThis "unmasking" is particularly
the
ol
is
substantially
are
blind likely
trial.
to
occur
the active drug can be
if
recognized from a physiologic side effect such as the bradycardia that often occurs
with
beta-blocking
quence of
this
the delusion that
is
The
agents.
conse-
type of protocol violation
symptoms and other
subjective data have been determined with the presumptive "objectivity" of an effectively
maintained double-blind technique.
doctors (or patients)
If
become
successful-
ly able to differentiate the active
drug from
the placebo, the clinical
converted
trial
is
into a pseudo-double-blind exercise, having all
the
logistic
disadvantages of double-
maintenance of an assigned therapeutic regimen. Another important aspect of compliance refers to the maintenance of a research protocol. During
blind research and none of the scientific
or other
drugs that might affect the results of the
patient's acceptance or
the course of a therapeutic
many
trial
planned procedures must be carried out by both investigator and patient. The compliance or non-compliance given to these protocol procedures can affect the results of the research. investigation,
advantages.
A
third protocol
trials is
problem
in therapeutic
the need to exclude supplementary
main drug under
investigation.
The
viola-
tion of this specification of a protocol
often
overlooked
if
the
violations
is
have each
with equal frequency in group of patients receiving the compared
occurred
Consequences
therapeutic regimens. Since the qualitative of the "ineligible" medica-
characteristics
may
tions
not be equivalent for each group
may be
respon-
become
errone-
of patients, the differences for distinctions that
sible
A
was proby Chalmers'. lie described the way in which a sophisticated group of patients employees of the National Institutes of Health), who were participating recent example of this problem
vided
(
ously attributed to the principal therapeutic
in
agents.
contents
Of the many other
potential
issues
investigator non-compliance, the only
be cited here
to
in
one
the problem of pre-
is
serving the "letter" of a research protocol
while
ignoring
its
For example,
"spirit".
127
of 'compliance bias'
a double-blind clinical
the
of
trial,
tasted the
to
distinguish
capsules
vitamin C from placebo. According to Chalmers, the rates of the outcome event
were substantially
in the trial
patients
who
different for
did or did not correctly identi-
fy their medication.
suppose we are conducting a therapeutic trial to determine whether the maintenance of normoglycemia will prevent vascular
Another important form of patient noncompliance is improper attendance for re-
complications in adults with diabetes melli-
initiated.
we prescribe a fixed dosage of an hypoglycemic drug and determine whether the patient complies with the prescribed regimen, we have adhered to the If
tus.
oral
On
specifications of the protocol.
hand,
if
we
the other
peated
examinations If
after
treatment
is
a patient scheduled to have
a particular test done at 4 weeks and at 8 weeks after treatment appears only at
6 weeks, where do the results of the 6-week test get counted? Suppose the patient does not appear often enough to have
all
the
check whether the sugar is actually being
periodic tests that are needed to rule out
we
streptococcal infections or anicteric hepa-
fail
to
patient's blood maintained in a normal range or
if
adjust the dose of the drug so that produces normoglycemia, we have not complied with the basic idea of the re-
episodic
events,
How
such
as
asymptomatic
are the incomplete data to be
fail to
titis.
it
analyzed? There are no simple
search.
answers to these questions. Each decision requires subtle judgments according to the
There are many other ways
in
which
an investigator's non-compliance with protocol can distort the research data. Nevertheless, the published reports of a research project efforts
made
no indication of check whether protocol
contain
often
to
particular circumstances that are involved.
The
ultimate act of non-compliance, of
course,
is
the
patient's
issue
to
be cited
monitor the compliance of clinical investi-
the
gators
who perform
trials
of
new may
2.
academic Compliance by patient. While com-
several
different
institutions.
ways.
One
later,
problems of
the
of this discussion,
and
be reserved for a later installment in this series. The problems are difficult, complex, and not always well managed by actuarial
("life-table")
analyses
that
are usually proposed as a solution.
not be
plying with the prescribed medication, a patient may violate the prescribed protocol in
drop
will
demanded when a federal agency sponsors a multi-center therapeutic trial involving investigators at
to
analyzing data for drop-out patients are
beyond the scope
but an analogous monitoring
decision
out of a study altogether. Except for one
compliance has occurred. An interesting aspect of the peculiar "double standard" used on the current research scene is that pharmaceutical companies are expected to drugs,
statistical
violation
consists of breaking the double-blind code.
E.
The "compliance sample"
A
different type of biostatistical
arises
when
a therapeutic
trial is
problem conducted
with a "compliance sample" of patients. Such a sample arises in the following way: Before admission to the trial, the patients who are otherwise eligible are screened to
determine their ability and willingness
The architecture
128
of cohort research
comply with both the protocol and the
to
therapeutic regimens
whom
Patients
under investigation.
the investigators regard as
non-compliant arc then excluded from admission, so that the trial is conducted with the group of seemingly cooperative patients
who constitute the "compliance A clue to the existence of such
sample". a
sample
criteria
noted from an account of the used for excluding patients from
a
These
can
be
trial.
depend
criteria customarily
on various features
prognosis,
oi diagnosis,
co-morbidity, or co-medication. teria also include a
the cri-
If
statement about "will-
ingness to cooperate", the investigators have
used
a
compliance sample.
To choose
patients
this
in
a
therapeutic
trial,
all.
the
way seems in
conduct-
investigators
do not wish to expend major amounts of time and vigorous research efforts on patients
who
proposed proposed
are not likely to maintain the
medication
or
appear
value
treating
of
many
received
justified
which
with
excellence
with
patients
asymptomatic hypertension, the
trial
has
for
the
praises
was designed
it
and conducted. Nevertheless, the group under study contained a highly restricted compliance sample of hypertensive patients. The selection procedure was described as follows":
an
Since
number
appreciable
their occurrence as
psychopaths,
to
antagonistic
incompetent persons
mentally
personalities,
minimize as possible. "Skid row"
much
vagrants,
alcoholics,
dropouts
of
would jeopardize the study, we wished
are not properly cared for at home, and
who to
perfectly reasonable. Alter
ing
the
of
for
clinic
the
trial.
period outs
all
who those
one reason or another could not return regularly are therefore excluded from the pre-randomization trial
In addition,
serves
that
are
to
eliminate
missed
other potential drop-
during
the
evalua-
initial
tion.
The VA
investigators have not published
data on the
number
whose
of otherwise eligible
demon-
the
patients
By
energy and increase the efficiency of the
non-compliance kept them from being admitted to the trial. It has been estimated 10 that between one half to two thirds of the patients with eligible blood
research activities.
pressures were excluded from entry.
examination
tor
procedures.
screening out the non-compliant patients, the investigators would
eliminate wasted
In attaining this efficiency, however, the investigators risk
is
take a substantial
that the compliant patients
risk.
may
strated
anticipated
The
exclusion of this large proportion of pa-
The
tients
would not
not
in the
compliant patients
who have the under treatment. The risk is the excluded non-compliant patients
(or
)
affect the
results
who were
found treated
but would impair the ability draw general conclusions about the
properly represent the people
in the trial,
condition
to
tiny
treatment of other patients with hyperten-
if
trial
who are may have
their vascular systems benefitted
by thera-
constituted only a small proportion of the
sion.
group of otherwise eligible patients. If the excluded group occupied a large fraction of the eligible cases, however, the results of the trial may be seriously compromised, particularly if compliance and
willing to
total
therapeutic responsiveness are inter-related.
The
results of the trial may be pertinent compliant but not for other patients with the same clinical condition.
for
An example occurred istration
ment
of
this
type
of
problem
in the recent Veterans AdminCooperative Study 20 of the treat-
of hypertension. Because the results
provided
"hard"
(randomized)
evidence
Docile hypertensive patients
comply
in
such a
peutic agents that lower blood pressure;
but these agents may not work as well on the many non-docile hypertensives who are non-compliant. As public campaigns are mounted to deliver appropriate treatment to all patients with hypertension, the results (if noted and evaluated) may be somewhat disappointing. If the rate of vascular com-
plications
is
not reduced as
may
much
as
was
from the unresponsiveness of the non-compliant pa-
expected, the disparity
arise
Consequences
whose therapeutic
tients
refractoriness
had
not previously been discovered.
Let us assume that tense group of people a
The "compliance confounded cohort"
F.
The
problem to be cited here is particularly subtle and complex. It can arise if the ability to comply with a therapeutic regimen is also related to the event that is to be noted as the main outcome of treatment. If compliance ability and outcome event are closely related, the results will be distorted by a confounding last
30%
the
will
who can comply with treatbe destined to have an outcome-
event rate that differs substantially from the corresponding rate in the people
who
do not maintain compliance. Consequently, an ineffectual regimen may falsely appear to be distinctly beneficial (or detrimental) to the people who maintain it.
To
suppose that people who have a high degree of the particular kind of inner drive or stress that might be called psychic tension are illustrate
this
point,
under
interval
study.
identify a
have
also
non-tense
In
people, the corresponding rate
is
5%. Let
us further assume that the population under
study
consists
of
70%
non-tense
people
and 30% tense people. If nothing were done to this population, we would expect the overall rate of cardiovascular events to
be
=
(.70) (.05) + (.30) (.30) .035 + .09 .125 12.5%
=
hort of people
ment
we can who will
rate of cardiovascular events during
Regardless of treatment, the co-
variable.
129
of 'com)>liance bias'
=
Now
suppose that the entire population randomized clinical trial in which the action of a special new diet is being enters a
tested.
new
Half the population
and the other
diet
maintain
is
assigned this
half continues to
usual dietary pattern.
its
The compliance problem might now For the people
cur as follows.
who
not receiving a special diet, there difficult
is
oc-
are
no
regimen acting as a provocation
The only drop-out
to
drop
develop cardiovascular disease than people who are not psychically
is
the "nuisance" of participating in the
Let us now suppose that an unappealing and difficult-to-maintain new
out rates in the no-diet patients would be
more
likely to
tense.
diet has
been proposed
prevents
cardiovascular
an agent that disease. Let us
as
assume that the diet is actually assume that the tense people have great difficulty in complying with this new diet, whereas nontense people are much more able to comply. Under these conditions, when the diet is further
ineffectual. Finally, let us
prescribed for a large population, the rate cardiovascular disease will be lower people who maintain the diet than in people who do not. The false conclusion of
in
may
then be that the diet effectively pre-
in
the
trial
this
may
type of problem
multiple
(MRFIT)
risk
now
factor
arise 1
intervention
launched 11
being
throughout the United States,
I
some contrived numerical data
to illustrate
the possibilities.
clinical trial itself.
shall
cite
incentive
Consequently, the drop-
the usual attrition to be expected in any trial,
and the
rates
would be
similar in
the tense and non-tense patients. Let us
assume that these drop-out rates are 5% in each group. The remaining population in the no-diet cohort will thus be composed of 95% of the starting members of each psychic group, and will be 95% of its original size. [This figure can be verified as (.95) (.70) + (.95) (.30) = .665+ .285 = .950 = 95%.] For the patients in the cohort assigned to receive the special diet, the difficulties
of maintaining the diet will create a strong
stimulus toward dropping out.
vents cardiovascular disease.
Since
out.
Among
non-
tense patients, let us assume that the drop-
out rate in
is
10%, twice
as high as the rate
similar patients not receiving the diet.
Among
tense
patients,
the
problems
of
maintaining the special diet are formidable, so
that
80%
of
these patients drop out
of cohort research
The architecture
130
total
the
special
he
c.u,
.70)
|
because
I
drop-out
high
people
for
as-
Now
consider what
us
let
as
outcome
the
vascular events hort
nl
people
pate
in
the
would be people
+
309?
non-tense
or
event
rate
(.30)
(.285)
.11S75.
When
completed the Us:-.
the
cohort
maintain
would
.95)
would be (.03325)
adjusted
tor
who
ac-
special
59?
people and 30'
The
total
group
.0315
who
i
rate
event
•
63%
the 6'
will therefore
(.06)
When
iii
the
diet,
r
for
he
i
rate
event rate
of
ol
non-tense
tense people.
the special
.05
I
diet
(.63) + (.30)
.0495
.018
would
12.5%. For continued to
who
the
in
this
.125
=
people
ot
the
he
trial,
tense
of
the population of no-diet people tually
partici-
to
2S.5%
the
in
total
(.0S55)
For the co-
study.
the 66.59?
in
665) +
05
cardio-
group, the event rate
no-die!
The
people.
be ob-
for
who continued
59?
.ind
will
rates
this
in
4.95%.
adjusted for the special-diet people
actually completed the
trial,
this rate
.072 = 7.2%. would be (.0495) / (.69) If we knew nothing about the relationship of personality, compliance, and cardio~-~-
vascular rates
in
tense
we would observe
vs.
non-tense peo-
only the outcome Without a stratification for psychic state, we would not be aware of the differential drop-out distinctions that had produced the differences in outcome, and we might draw conclusions based only on the gross outcomes. Thus, in a randomized clinical trial comparing a special diet vs. no diet, we would have noted that the people who maintained the special diet had a cardiovascular event rate of 7.2%; and thi the people who maintained no special o >t had a corresponding rate of 12.5%. Th special diet would appear to
ple,
rates.
'
7.2
-
";}
42.4%. This mag-
=
nitude ot reduction would obviouslv seem
and furthermore, with
clinically significant;
numbers
the large
the
trial,
ot
patients entered into
would
difference
ilie
also
he
.significant".
statistical!)
signed to the special diet. served
12.5
by
30)
•
=
would be anticipated
rate
.90)
69%.] This seem unusually
will not
relatively
a
figure
have reduced the cardiovascular event rate
12 5
I
.69
.06
.63
low,
as
compliance
[This
cohort.
original
verified
(.20) rate ol
thus
will
diet
the
,ii
who maintain he reduced to
group of people
The
The obvious conclusion would seem to he that the special diet had reduced the rate
of
than
MY
he
cardiovascular
— and
<
by
disease
the conclusion
yet
more would
wrong. The apparent benefits
totally
of the diet, ineffectual,
which we know was actually would have arisen only from
the fact that it received compliance mainly from people destined to have a low rate of cardiovascular events.
At
this point in the discussion, a student
of clinical that
We
complete.
results of the
have not yet looked
is
in-
at the
drop-out patients. Under the
noted
conditions a
would immediately note
trials
the analysis presented thus far
substantial
earlier,
we
should find
difference in cardiovascular
two groups of patients who These rates would be 12.5% the no-diet group and 24.4% in the
rates
in
dropped in
the
out.
special -diet group.
The explanatory
calcu-
lations are as follows. In the no-diet group,
non-tense
the
; -
(.70) (.05)
would contain
(.30)
original population. in this
would contain and the tense drop-outs
drop-outs .035
(.05)
The
=
.015 of the
cardiovascular rate
group of drop-outs would be [(.035) (.30)] / .050 = [.00175 +
(.05) + (.015)
.00450]
.050
/
=
.00625 .05
=
.125
=
12.5%. In the special diet group, the non-
would contain (.70) (.10) and the tense drop-outs would con-
tense drop-outs
=
.07
tain
(
.30
)
(
.80
)
=
.24 of the original
popu-
The cardiovascular rate in this group drop-outs would be [(.07) (.05) + (.24)
lation.
of
"With calculations
that
are
too
extensive
to
be
re-
peated here, it can be shown that this difference in the two groups of compliant patients will have a P value below .05 ( by x 2 test ) if as enrolled in the trial.
few
as
636
patients are initially
Consequences
==
(.30)] / .31 .31
[.0035 + .072J/.31 == .0755/ The finding that one
.244 == 24.4%.
=
=
drop-out group had a cardiovascular rate twice as high as the other drop-out group
immediately
should
our
alert
suspicions
that something extremely peculiar has hap-
pened. if we look at the results who were randomized, rethose who dropped out. we
Furthermore, of
all
patients
gardless
would are
of
find
the
that
would be
rate
cardiovascular rates
same. In the no-diet group, the
the
+
[(.11875)
=
The
of 'compliance bias'
possibility of this type of error
a major hazard in the
—
particular psychologic test or other psychi-
examining instrument can
atric
would
have
1.00
be [(.0495) + (.0755)] / [(.69) + (.31)] = [.1250] / [1.00] = 12.5%. This simiin cardiovascular rates for the two randomized groups, regardless of drop-outs, would help confirm the existence of some
larity
phenomenon among
strange
the drop-out
To
get
peutic
this
all
additional
would require
trial
information,
the
that
thera-
may be
be conducted with an extra-
who
psychic
unwilling or unable to maintain
other interventions that are prescribed as
MRFIT
"active therapy" in the
rather
than
an
thus possesses
major
for a
MRFIT
all
the ingredients needed
scientific error in the interpre-
tation of results.
The most cogent way error
of avoiding this
for the patients' psychic condition
is
data,
occurrence
in
drop-out patients
otherwise been "lost to follow-
up"; but
the outcome event
if
is
a non-fatal
cardiovascular event (such as angina pec-
myocardial
toris,
intermittent
infarction,
but
investigation
the condition of serum lipids.
its
Since
unpalatable
equally the
diet,
death, the investigators can usually learn
about
trial.
the "control" group will receive no diet,
have dropped out. This intensity of followup surveillance almost never occurs in a therapeutic trial. If the outcome event is
who have
who
constitution
the particular forms of special dieting or
ordinary passion for getting complete, detailed follow-up data for all patients
best
that the people
is
particular
this
standard
cases.
however,
we
are especially sus-
ceptible to cardiovascular disease. Another
(.00625)]
In the special diet group, the rate
/
who
discern the people
reasonable belief
[.12500]
is
MRFIT
of the
An abundance of evidence has now been assembled to suggest that a distinctive relationship exists between certain personality types (or psychic states) and subsequent cardiovascular disease. The main issue is no longer whether such a relationship exists, but how to identify it by which
12.5%.
+ (.05)]
work
study.
=
[(.95)
131
to
be examined with
procedures that
test
are as thorough as those used for examining
different
could
needed bias
With such
the investigators could identify the
is
degrees
use to
the
of
psychic
results
and
analyses
compliance not be appealing, however, because
demonstrate
that
absent. This approach
scientifically
"risk",
the
for
may
claudication, or stroke), the occurrence or
the questionnaire and other written instru-
non-occurrence
ments used for examining psychic status have not received intensive attention from
to
document
living
of
this
event
in a standardized
patients
is
difficult
manner
who have been
lost
for to
follow-up. Because of these difficulties in the
follow-up
investigators
of
dropped
may be
patients,
the
strongly tempted to
main analyses to the patients who, complying with the research protocol, continued under observation. If this tempconfine their
tation
is
accepted,
the
investigators
will
reach the erroneous conclusion described earlier.
epidemiologists.
The
ideologic
belief
of
most contemporary epidemiologists has been that "risk factors" arise from nurture but not from nature from such environmental features as food, water, tobacco smoking, and exercise; but not from such
—
constitutional
features
as
heredity
and
Because of this ideologic belief, both heredity and psyche have been generally ignored in epidemiologic re
psychic status.
architecture of cohort research
132
The
search,
and suitable
by using age, race, sex or other
instruments
for bias"
have not been developed or applied for
variables
obtaining the necessary data
instead
scientific
large co-
in
absence of suitable psychic exami-
nations and correlated data
way
other for
is
even
drop
formed
men compliance, are
different
the
in
torted
l>v a
Equalit) not
will
have basic
received
a
major
may be
dis-
however, because the unwilling to con-
MRFIT
clinics
for
To cope with investigators may need clinics, home visits, or
examinations.
night
.special
may
surveillance
may be
problem, the
drop-out
the
regimens,
data
diagnostic
necessar)
other
patients
compliance-confounded cohort.
of
arrange
to
regi-
procedures
patients
that
will
allow
receive suitable
to
low-up examinations
fol-
•
•
six features of
to
potential
indicate for
to
scientific
does
not
its
bias,
7
intricacies
and
contemplate
If
the
and
they
rule
may
out
selection, detection,
i.se
due to inequities in and chronology, com-
merely an investigator hopes or wishes will
appear
jf
on
focus
restricted
"hard"
addition to the cited
In
however, therapeutic investigations that depend only on "hard data" create an important humanistic hazscientific defects,
ard.
The
may be
idea
analyses
biostatistical
established
that
unable to
are
dis-
between an act of patient care
tinguish
and an exercise
disappear
not
Boyd,
and
medicine.
in veterinary
J.
exist.
It
Covington, T.
R.,
Coussons,
also will not dis-
the data analyst tries to "adjust
R.
of
2.
3.
W.
R., Stanaszek,
Drug
T.:
compliance;
noncompliance patterns, Am. 31:362-367; 485-491, 1974.
I.
Analysis
of
II. J.
F.,
defaulting.
Hosp. Pharm.
Blackwell, B.: The drug defaulter, Clin. Pharmacol. Ther. 13:841-848, 1972. Chalmers, T. C: Quoted in Internal Medicine News, p. 4, Oct. 11, 1973. A. R.,
4. Feinstein,
Wood, H.
F.,
Epstein,
A.,
J.
Taranta, A., Simpson, R., and Tursky, E.:
study
controlled
phylaxis
of
methods
three
streptococcal
against
of the
II.
A
pro-
of
infection
population of rheumatic children.
in
a
Results
three years of the study, including
first
evaluating
for
the
oral prophylaxis, N. Engl.
J.
maintenance
of
Med. 260:697-702,
1959. 5.
A.
Feinstein,
Kloth,
phylaxis
R.,
Spagnuolo,
M.,
Tursky, E., and Levitt,
H.,
of
recurrent
Jonas,
M.:
S.,
Pro-
rheumatic fever. Thervs. monthly
apeutic-continuous oral penicillin injections,
does not
"hard"
obtained by direct conversation
are
methods
search. Like the biases
it
a
the malefaction.
the
re-
that
is
with patients, biostatisticians have abetted
investigator
chosen hypothesis and invalidate his
beet,
that
bias,
his
bias
petuating
its
vitiate
tee
clinical science that
Determinants
bias can
detection
research.
counter-hypotheses,
ph
have created the
data while ignoring important "soft" data
1.
1-
of
"soft"
J
because
information
this
scientifically
neglected
but often irrelevant or erroneous. By per-
creating major biostatistical
Compliance
selection
is
compliance should
be added 7 and chronology bias as another prime source of the confounding variables that produce fundamental errors in biostatistical analysis. Confounding variables in biostatistics are like counter-hypotheses in any other form
delusions.
abandoned
or
pa-
the
to
References •
suffice
talking
who have
for detecting cardio-
vascular events.
These
medicine:
clinical
tient. The- investigators
hazard of a
tinue to return to the
this
of
to restore attention to a traditional activity
of
non-COmplUmi
therapeutic
drop-OUt patients the
per-
To acquire the data needed for analyzing compliance and ruling out the existence of compliance bias, investigators will have
it
achievable,
In
equally
is
regardless
their
investigators thai
on the "incon-
ing.
the cardiovascular rates
If
for
several
signal
events
everyone,
for
surveillance,
out, so thai the detection
cardiovascular
oi
patients to continue
medical
intensive
the)
it
MRFIT
the
all
an-
analyses,
of trying to avoid the cited error
receive
to
conveniently available,
concentrating
venient" variables that create the confound-
hort studies. In the
are
that
of
6. Feinstein,
J.
A.
A.
M.
R.,
A. 206:565-568, 1968.
and
Levitt,
M.:
The
role
of tonsils in predisposing to streptococcal in-
Consequences
fections
N. Engl.
and recurrences of rheumatic Med. 282:285-291, 1970. J.
7. Feinstein,
A.
R.:
Clinical
X.
statistics,
Pharmacol. Ther. 12:704-721, 1971.
Clin. 8.
Feinstein,
A.
R.:
Clinical
biostatistics.
XI.
Sources of 'chronology bias' in cohort statistics, Clin. Pharmacol. Ther. 12:864-879, 1971. 9.
Organization of a long-term mul-
Freis, E. D.: ticlinic
therapeutic
in
trial
hypertension,
in
Gross, F., editor, with the assistance of Naegeli,
and Kirkwood, A. H.: Antihypertensive Principles and practice. An international symposium, New York, 1966, SpringerS.
R.,
therapy.
Verlag, pp. 345-354. D.: Personal communication.
and Barsky, A.
Diagnosis
11.
Gillum,
12.
and management of patient noncompliance, A. M. A. 228:1563-1567, 1974. J. Gordis, L., Markowitz, M., and Lilienfeld, A.
M.:
F.,
Why
patients
don't
J.:
medical
follow
A
study of children on long-term antistreptococcal prophylaxis, J. Pediatr. 75:957advice:
968, 1969. 13.
Mazzullo, P.
tion
M.,
J.
Lasagna,
and Griner,
L.,
Variations in interpretation of prescrip-
F.:
The
instructions.
prescribing habits,
need
A.
J.
for
improved
M. A. 227:929-931,
1974.
H. P., Caron, H. S., and Hsi, B. P.: Measuring intake of a prescribed medication. A bottle count and a tracer technique compared, Clin. Pharmacol. Ther. 11:228-237,
17. Roth,
1970.
and Cluff, L. E.: A review of medication errors and compliance in ambulant patients, Clin. Pharmacol. Ther. 13:463-
18. Stewart, R. B.,
Group Diabetes Program. A study
19. University
the
of
of
effects
hypoglycemic
on
agents
vascular complications in patients with adultonset diabetes.
Part
I:
Design, methods, and
baseline characteristics. Part II: sults,
Diabetes
20. Veterans
Mortality re-
19(Suppl. 2): 747-830, 1970.
Administration
Study
Cooperative
Group on Antihypertensive Agents. Effects treatment on morbidity in hypertension.
of
notated
Results in patients with diastolic blood pres-
patients
with
sure
R.
B.,
therapeutic
regimens,
Depart-
(
C. T.: Quoted in Medical M. A. 227:1243-1244, 1974.
Kaelber, J.
A.
Lorenz,
K.
Z.:
The fashionable
J.
News,
fallacy
21. Wilson,
90
through
114
mm
T.:
J.
Compliance with
Hg,
instructions in
the evaluation of therapeutic efficacy.
but
variable,
1973. of
averaging
II.
A. M. A. 213:1143-1152, 1970.
mon
graphed pamphlet.
15.
16.
and Sackett, D. L.: An anbibliography on the compliance of
Haynes,
ment of Clinical Epidemiology and Biostatistics, McMaster University Health Sciences Centre, MimeoHamilton, Ontario, Canada, 1974. 14.
Naturwissen-
description,
468, 1972.
10. Freis, E.
R.
with
133
schaften 60:1-9, 1973.
biostatistics.
Sources of 'transition bias' in cohort
dispensing
fever,
of 'compliance bias'
frequently
Clin.
Pediatr.
A com-
major unrecognized (Phila.) 12:333-340,
SECTION
TWO
OTHER ARCHITECTURAL PROBLEMS
The
difficulties
in the several
noted in the preceding section can be magnified or embellished
ways that are discussed
from a misplaced confidence
arises
The
statistical consultation.
sleeve,
few chapters. The
statistician usually has
may sometimes be empty
but the sleeve
The second problem
in the next
first
many
problem
powers of
in the prophylactic or remedial
valuable tricks
or the trick
may be
up
his
a delusion.
another misplaced confidence, caused by the disbetween mathematical ideas based on random sampling and the medical
parity
arises as
reality of "samples" that are
never selected randomly. Beyond the potential bias
an investigator can add further distortion
of a "rancid" sample,
to the data
by
using the unequal examination procedures that create "tilted targets."
A
prominent source of confusion
"control"
is
is
the ambiguity with which the concept of
used and abused in the design of research. In most
in statistical courses
on "experimental design," the control
maneuver, but neither
scientific
nor
group
trol also
A
of
people
who
receive
To confound
maneuver
or the con-
the confusion, the idea of con-
has been applied to at least ten additional ideas in medical research.
"case-control" study
control
it.
and
the comparative
statistical instruction contains specific atten-
tion to important issues in choosing either the comparative trol
scientific plans is
is
diverted from
effect rather
its
is
the most frequent situation in which the idea of
customary
scientific
connotation and
is
applied to an
than a cause. In case-control research, a group of diseased people
is
compared against a group of controls who do not have the disease. The comparison can be used in a "cross-sectional" study to examine the diagnostic utility of a particular marker or test in discriminating between diseased and nondiseased people. Alternatively, the case-control study can be "retrospective," aimed at (
)
examining an etiologic suspicion. In both types of case-control arrangement, the standard forward architecture of scientific research
is
drastically altered
and the
choice of a suitable control group becomes the crucial feature that determines the value of the results. Since mathematical principles again offer no help in mak-
ing this choice, the decision requires careful scientific strategies.
Although diverse mathematical
tactics
have been developing for manipulating
the quantitative results of both types of case-control study, rigorous standards
have not yet been established
for the scientific principles of a satisfactory re-
search architecture. In diagnostic case-control studies, the key issue
is
the degree
135
136
Other architectural problems
of discrimination that
is
sought within
the-
diverse spectrum of diseased
diseased people. In etiologic case-control studies, the of suitable procedures to avoid the effect reasoning
is
conducted
mam
kev issue
is
and nonthe development
biases that are inevitable
in a logically
backward temporal
when
direction.
cause-
CHAPTER
10
malpractice— and the
Statistical
responsibility of a consultant
A
pathologist has often been called a
"doctor's
In
doctor."
situations
which
in
any final scientific method," Bernard urged clinical investigators "reject to lish
necropsy, biopsy, or cytology can provide
statistics
confirmation for diagnostic reasoning, the
mental
pathologist
the
is
have
clinicians
consultant
whom
to
turned
traditionally
for
verification or occasional refutation of the
reached during the
decisions
diagnostic activities.
become the
now He is
doctor."
whom
investigators regu-
advice about decisions
larly turn for
The
during research.
who
preceding
statistician has
"researcher's
the consultant to
sultant
A
statistician
made
the con-
is
When
Pierre Louis, more than a century was developing and advocating his "numerical method" for the appraisal of ago,
therapy, 19 his clinical espousal of statistics
was opposed by most cians of the day. 11
and
to help interpret
cause
I
there
is
the scientific value of statistics.
A
when Claude Bernard was
.
nature."
Saying that
never yield
century
biostatistics
11:898, 1970.
.
.
statistics .
.
estab-
chapter originally appeared as VI." In Clin. Pharmacol. Thcr.
this
—
.
can
were not con-
it
much
...
should only praise
I
be
for counting too
to
much
any mind
precision,
relative
be-
it,
but noise made about such poor reproach (the statisticians) it
.
for
it
.
.
useful;
and
for de-
into the facts.
...
is
.
.
.
only a
changes under the
observation of the same man, according to the year, the season, and the reigning medical constitution.
much and
scientific truth (or)
With the same name, "Clinical
".
if
This mathematical exactitude
establish-
about phenomena as they construct them in their minds, but not as they exist in
.
so
I
believe
really
clining to put
"mathematicians reason
denounced simplify too
.
...
science
results.
ing experimental discipline in medical re-
.
the
in
sidered as the very keystone of the arch of all
.
of the leading clini32
-
the application of statistics to medicine
If
have not always held this crucial role in the world of medical research. Until the past few decades, clinical investigators were strongly distrustful and sometimes actively antagonistic about
he
'
were not rated too high,
Statisticians
.
21
The sentiment was remarks of the renowned Armand Trousseau: 34 summarized
the results.
(who)
pathological
often asked to check the
is
analysis of the data,
search,
experi-
for
and
science." 1
design of a research project, to plan the
ago,
foundation
a
as
therapeutic
From beyond tune of taunts
book,
the clinical world, the for-
statistics
as
How
Koestler's
has borne such recent
Darrell
Huff's 20
to lie with statistics,
remark that bathing
a
bikini
is
interesting;
From within
suit:
provocative
and Arthur
"Statistics
are like
what they reveal
what they conceal
is
vital."
the statistical fraternity, an
"anthology" of diverse misuses and abu
137
Other architectural problems
138
been pro\ ided by Wallis and Roberts.
of statistics has
in the text
:
Despite these caveats, statisticians have not only to endure in the world
managed
research, but
of clinical
more recently
to
The editors of good medical journow insist oil appropriate statistical
prevail.
nals
reviews
which were formerly applied mainly to the choice and application of statistical tests after the investigator had planned the research,
enough
often
solicited
and sometimes
early
govern
to
xvclcomed
have
Statisticians
new
the
recommended
one leading journal" the word "significance" has been removed from general circulation and reserved
of exposure to books
for use onlv in a statistical context. At the
tical
extremes
of
the statistician has begun to feel relatively
editor of
a
for publication,
and
now
are
to afleet
the basic design of the project.
manuscripts are accepted
before
before the project begins. His ideas,
tically
at
mathematical obeisance,
the
has de-
psychologic journal
challenges and have
pansion of their concepts
After a generation
roles.
of
and courses on
"experimental
confident about his
this ex-
planning
in
skill
statis-
design."
re-
cided to accept only manuscripts that have
search, and. for several decades, statisticians
by P
have urged that they be consulted "before you begin the project, not afterward." In
the "super-significance" demonstrated
values of less than 0.01.° In concepts about etiology of disease, statistical "proofs" for
the
causes
of
chronic
diseases
are
now
accorded the respect that was once reserved for Koch's experimental postulates about acute infectious disease. Statistical validation has been emphasized generally
by the
FDA
guidelines
new
sanctioning
claims
about
Advance approval and con-
drugs.
comitant participation by statisticians have
become
a
prerequisite
demand
large-scale clinical trials will be
suitable agencies.
And
before
funded by
courses in statistics
have become standard parts of the curriculum
leading
doctoral
to
degrees
in
either medicine or biology.
To achieve may have had ma\-
still
this
status,
to fight
the
statistician
an uphill battle and
many scars of the now clearly arrixed.
bear
but he has
conflict,
Further-
more, his consultative authority has been expanded in recent years to include prex'ention,
the
and not
just diagnosis
statistical
increasing
constantly
who come
consultant has willingly ac-
the researcher's doctor.
As
every
knoxvs,
of
medical
practicing
course,
a
cepting
A
responsibility.
physician
xvho
on the problems brought to him by a patient also becomes responsible for what he does in their management. He must be fullv attentive to the subtleties takes
as
the
as
xvell
grossly
oxert
aspects
of
the problems; he must not perform procedures for xvhich he is untrained or un-
he must guard against breaches and he must be ready, x\ hen necessary, to defend his actions if they are questioned by a jury qualified;
of accepted ethical standards; r
of his peers or in a court of laxv.
In accepting authority as a consultant
for the "ailments" of clinical research proj-
in
clinical
research,
how
His assistance, xvhich xvas formerly sought remedially after a project xvas completed, is now often solicited prophylac-
statistician
accepted
the
°Th' egemony of statistical doctrines has not been confined biomedical publications. In the literature of social sck the traditional deference to a statistician's < imprimatur recently been subjected to the "radical" rimary attention be given to the concepts proposal tha. the research, rather than to the data and and methods
doctor
consultant
patient's
cannot accept authority xvithout also ac-
and treatment,
ects.
floxv
"patients,"
as
cepted the authority, prestige, and other rewards that go along xvith his role as
in a recently issued series of
for
the
receiving
of investigators
What
has
sort of "licensure" or "boards"
used to
test
kind
"pathologist"
serves
xvell
of to
his
detect
qualifications,
his
or
the
responsibility?
are
and xvhat
"review panel" When he
failings?
performs experiments by applying untested unproved models to a research project, does he obtain informed consent theories or
>
the statistical
a.
sis. 38
from the investigator xvho
is
his "patient"?
malpractice
Statistical
By what kind
Pythagorean or other pledge the quality and
of
he
does
oath
— and
How often does he How are his instances
journals
commit malpractice?
review
guarded against,
appropriate
and the people who plan
papers,
semi-
statistical
dom
insights
considered during a
edu-
statistician's
have appeared about the errors committed during consultative activities, 29, and the retiring papers
Sporadic
3(;
may
president of a statistical society
casionally use his farewell address to
about certain
colleagues
his
or practical blunders. 33,
30
oc-
warn
intellectual
In such an ad-
two years ago, Frank Yates 39 com-
dress
much
plained that "the standard of
work
to-day statistical
These matters sponsibility
.
.
regrettably low.
are primarily the re-
.
the
of
is
day-
university
statistical
departments and are a direct consequence of the present-day obsession with advanced theory, largely divorced from practice."
of
Yates also quoted an earlier concern
Ronald Fisher:
Sir
"We
are quite in
danger of sending highlv trained and highly
.
.
.
young men out
never
the
solicit
clinicians
attendance
who might
expert
of
contribute occasional
and touches of
reality to the pro-
ceedings.
The need for vigilant critical self-apby biometricians has been accen-
praisal
tuated in recent years because of the increased opportunity both for statistical
commit malpractice and
consultants to
the
for
malpractice to have catastrophic efIn
fects.
the
when
days
worked mainly post hoc
a
statistician
perform statistical analysis for a completed research proj-
ect, his
choice of the
to
wrong
analytic pro-
cedures might create an intellectual nui-
would not
sance, but
affect either the basic
design or the primary data of the project. After the statistical errors were discovered,
they could always be rectified later with a better set of analyses. If the statistician
improper advice before a project
gives
the
begins, however, he can distort both the
with a dense fog in the place
design and the data, so that the project may not be salvageable later. After the
intelligent
world
into
where their brains ought to be." These random censures by leaders of the profession have not brought a tradition of introspective review and con-
misdeeds in planning are and corrected, the entire project would have to be repeated. The
stant self-criticism to statistical consultants.
repetition
of statistics, there
clinicopathologic consultant's identified, sion.
not too difficult for relatively
the ones for which statisticians are asked
regularly
are
sought,
and revealed for public discusand publications of societies,
is
no counterpart of a in which a
In the meetings
may
recognized
small projects, but such projects are seldom
conference,
errors
biostatistical
tents
is
and
statistician's
literature
In the educational processes
the parochial con-
who work
prevent statisticians
advance consultation. The type is most difficult to repeat is a massive project in which many observations are performed during many to provide
of research that
years. This type of project, as exemplified
by a
large-scale cooperative clinical
with clinical topics from receiving regular
is
exposure to comments or suggestions from connoisseurs of the topics. Although stat-
statistical authorities
isticians
clinical
mitted I
invite clinicians either to
write
to
biostatistical
nars on topics in clinical research almost
cation.
.
do not or
world
in the
compensated for? These issues have rarely been discussed in the literature of statistics and are selor
;
phenomenon does not occur The editors of
of biometry.
ethies of his practice?
of malpractice detected,
139
the responsibility of a consultant
write
are constantly asked to speak at
meetings, to
clinical
instructive
readers
of
those
to
referee papers
journals,
papers
for
journals,
sub-
and even the the
to
clinical
converse
the
type
increasingly
of
research
delegated
to
has
that
design
the
and that
trial,
been
offers
of
them
the greatest opportunity for transgressions that cannot be remedied easily,
The complex clinical
trial
logistics
of a
if
at
all.
large-scale
create major difficulties
for
every aspect of the review, appraisal, and
Oilier architect mill
140
problems
verification that research must receive to be accepted in the scientific community. For purposes of review, the primary data
luctant to recognize or admit that
of a modest-sized project are readily avail-
"above the battle" and become actively embroiled as a partisan in the disputes. The patients whose future therapy should have been enlightened by the results are
for
mammoth among
fused
clinical
puterized
dif-
purposes
For
summaries
numerical
small
relatively
data
may be
coded pages a dense hulk of com-
conversions.
the
analysis,
raw
the
trial
a vast array of
or transformed into
shown
hut
inspection,
able of a
ol
ol
a
can generally be simple tabulations; hut
project
entirely
in
even the summaries may be too abundant for complete citain a large clinical trial,
tion.
In the material selected for publica-
tion,
crucial
among
dispersed
obscured
omitted,
a plethora of tables, or
conversion into percentages.
l>v
and other
regressions,
ments."
may be
information
And
statistical
the clinical
"adjust-
can seldom
trial
expended large sums of money flawed
emerged
1
challenges of statistical consultation have
begun
rcccntlv
among
years, these
plated
and
:
able its
to
acquire
support.
Even
and who are
new
effort
the
necessary
if
new
funds
for
investigators
and
problems have been contem" in major papers, abstracts,
letters to the editors of
Biometrics,
as
such journals
A
Statistician.
particularly exten-
and discussion of the
presentation
appeared
Journal of
in the
My
the Rot/al Statistical Society.™*
object
the remainder of this discussion
augment
The
and
Technometrics,
issues recently
in
comment
to receive public
biomctricians. Within the past few
sive
the
about whatever treatment has
condemned. Although misdesigned clinical trials have not vet been given specific attention, some ol the problems in meeting the general
assembly of another group of investigators are willing to devote many years labor in
position
its
as extolled or
American
of
leave
instead to the uncertainties of medical
left
dissension
be repeated. The repetition would require
who
may
product,
has
it
for a badly
this established
to
is
foundation of con-
funds can be obtained, however, the newdata may still not settle an existing argument, because the previous workers may
structive criticism.
claim that the populations under investi-
years,
gation were different.
to observe defects in activities that occur
For
all
these reasons, the basic strategies
and
of design
analysis in a clinical trial
must be particularly circumspect and above suspicion. When disputes arise about the conclusions,
planning
the
statistical
basic
deficiencies
strategies
devastating consequences.
The
can
in
have
investigators
have worked prodigiously for results that cannot be either reliably analyzed or
will
confidently effort
utilized.
In
addition
to
the
already invested in the project, large
amounts of energy may be consumed
Having worked both
In
the main failings of I
would be
colleagues of
the
committed emotionally to their years of devoted but possibly misdirected labor, become even more fervently committed to defending their results against the ineviThe sponsoring agency, re-
table criticism.
clinical
recent
biostatistical
publications,
I
my
than
I
fellow clinicians. 10 fair to
neglected
my
to
"
12
statistical
note
some
on
their
imperfections
side of the street.
To
illustrate the
trast the
way
problems,
trations,
as
shall con-
and and perform
The
consultants.
seem appropriate because
a tradition
I
that statisticians
cians are prepared for
analogies
statisticians,
less if
apparent
the controversies that follow as the par-
and
the
several
have described what seem to be some of
activities
investigators
on
bilaterally street.
many
have had an unusual opportunity
I
in
ticipating
as a clinical investi-
gator and biostatistical consultant for
for these illus-
and heritage that
2,500 years old. Statisticians experiential anecdotes,
their
clinical
clinical consultants
ways admire the way that
clini-
is
have
more than
may
not
al-
clinicians recite
make unquantified
judgments, and deliver personal care, but
Statistical malpractice
—and
the responsibility of a consultant
we cannot deny
lent
deliberately
comes
that clinicians have been prepared for their role as consultants, and that they have had enormous experience and frequent success. As statisticians, we might be able to profit by studying their methods.
Formulating a problem
1.
In consulting a medical doctor, a pais not expected to express his prob-
consultant,
statistical
connoisseur
a
scientific
of
141
be-
course,
of
necessary
the
and constantly recog-
concepts
nizes their intellectual priority in the de-
But he must depend on and ability to make these perceptions after he begins his consultative activities. The details and importance sign of research. his
willingness
of the scientific principles are not usually
tient
transmitted as part of his undergraduate
lem in a clear, articulate, well-organized manner. The patient believes that the doctor will know what questions to ask and
or postgraduate instruction.
know how
will
manner
a
in
to organize the information
suitable
and planning
the problems
characterizing
for
their
ment. Consequently, one of the
manage-
first
that a clinical consultant learns
is
things
an
in-
and arrang-
tellectual structure for getting
ing the information used to formulate pa-
The
problems.
tients'
reflected
structure,
The preparation tant thus
for
work
as a consul-
contains antipodal contrasts in
the education of clinicians and statisticians.
A
taught to identify and formuproblems in a carefully structured manner; but he is then left to develop diverse tactics of "judgment" for clinician
is
late patients'
A
managing the outlined problems. tician
is
statis-
taught a carefully organized set
managing
of mathematical structures for
and physical examination, contains such components as the chief complaint, present illness, and review of systems, and the formulation is
an outlined problem; but he is left to develop diverse judgmental methods for
expressed in such terms as diagnosis, patho-
press
in the contents of the history
genesis, prognosis,
A
and therapy.
The
clinician
the
the
find
however, has not been specifically trained in an intelstatistical
consultant,
The
scientific
statistical
courses in
"experimental design" are inadequate both for
the
of
search.
and for the "dework done in clinical re-
"experiments"
the
sign"
Many
clinical
projects
are
con-
methodologic explorations, not as experiments, and even ducted for
as
projects
surveys that
or
are
truly
experiments,
right
fortunately,
if
he could rely on
provide
to
often unable to provide a well-organized account of the "chief complaint" or other intellectual maladies of his problems in
The clinician may have learned be a medical consultant for pabut not how to be a "patient" for
research. to
statistical consultants.
the clinician has also
plan the basic architecture of clinical research, and a specialized knowledge of statistics does not become cogent until the
major parts of the architectural
design have been completed. 15
An
excel-
his
Un-
however, in appearing as a
do not contain details fundamental scientific concepts
to
outline.
is
tients,
the
the
"patient" for the statistician, the clinician
and explanatory experiments. 14 In "design," of
of
details
architecture in clinical research
"patients"
critical
needed
answers but un-
would not be a major occupational hazard
how
statistical principles
may
statistician
outlining
in
difficulty
for a statistician
statistical principles
do not distinguish the differences between interventional
the
able to select the questions.
necessary data and formulating the logical
problems of a research project. As noted
but unable to
questions
right
answers;
emerge with the
lectual discipline suitable for acquiring the
earlier in this series,
and formulating the problem. may emerge able to ex-
identifying
struction in the investigation.
Like the statistician, had no rigorous in-
methodology of
The
scientific
clinician's "basic science"
courses in medical school taught
him an
array of facts and laboratory procedures that are not readily applicable in clinical
investigation 17
;
his
courses in
statistics,
i
Other architectural problems
142
were not concerned with
any,
scientific
architecture in research and usually dwelt
mainly on theories of probability and on techniques
training
anecdotage and
unquantified precision
performing statistical tests; exposed him to the
l
clinical
his
he often
that
he
how
about
any
receive
and nowhere
formal
express
to
objective in his research and late
sequence
a
oi
instructions
clear,
a
how
succinct
formuprocedures
scientific
for attaining the objects
im-
escape
to
seeks
getting consultative help;
1>y
did
logical
to
has
about
ideas
know how
not
from
about
comes
"doctor"
statistical
a
nomenon he
not
objective of
He may
become- reticent about expressing
anything significant
got
he says "I'm
nomenon
X
purpose.
assessed
be-
may become
He-
refer only
to
is
when
studying phe-
in
what aswhat
X." without indicating
of
"If I've
or
for
imprecise" and
a nebulous "judgment" and
to
"experience"
statements of criteria, evidence of valida-
to seek
help
who knows
tion,
or
when he
tests
reproducibility
of
phenomena under
A
asked to provide
is
who was
consultant
clinical
with
the
for
investigation.
such
con-
poor history-giver would know what to do about the situafronted
The
a
knowledge
clinician's
of
the
necessary medical outline would help him
taking a history, any good clinician
cessivel)
here,"
pre-
and quantification, hut know how to express ideas
to deal
interested
arrives with
know
a pile ol data and wants to
tion.
In
when he
a specific objective
project.
who
Taking a history
knows how
current
the-
but
about research. 2.
studying, without ever stat-
is
ing the
precision
who may
interminably about
talk
them
research
to express
cisely or quantitative!)
may
I It-
work he has done for the past 10 years and may discuss a variety of conjectures about the mechanism of the phe-
pect
e.
The biostatistical consultation may thus become a peculiar paradox. A "patient"
who may
search.
the
with a patient
garrulous, reticently
who
is e.\-
uncommuniThe clinical
determine what information traditional
his
obligation
to
to
get,
get
the
and in-
formation would evoke the necessary effort
consultant will usually interrupt garrulositv,
do so. The statistician, however, may have neither the knowledge nor the obli-
probe reticence, and delineate imprecision. Moreover, if the patient is unable to pro-
no
cative, or flagrantly imprecise.
vide
a
satisfactory
knows how
to
history,
improve
it
by
the
clinician
enlisting the
aid of family, friends, or interpreters. clinician
becomes adept
interrogational skills
The
developing these not only because he at
has learned a basic intellectual architecture
what information is necessary, knows that he is obligated to get the information. The standards and traditions of his consultational craft demand that he do so. As a statistical "patient." however, the clinician may, for the reasons cited pre-
to
gation
also
because he
for
scientific
his
task.
outline
as
Having guide
a
might ease the history-taking and having no professional traditions that crethat
ate
a
responsibility
extracting
for
the
crucial information, the statistical consul-
tant
may meet
the challenge with several
varieties of evasion.
One
to indicate
but
appropriate
specific
type
of
evasion
is
used
who comes with a assembled data but who has
investigator
of
for
an
collection difficulty
describing what the research was about
and what the data are supposed
The by
statistician
saying,
"Let
gets
me
rid
to
show.
of the "patient"
have your data." In
give a poor history. Aside from
isolated comfort, the statistician can then
knowing that he ought to have controls and that randomization and clouble-hlind are good things, the clinician "patient" may be garrulous, reticent, or imprecise in try-
process the data with a series of analytic
xiously,
ing to
describe the "ailment" of his re-
statistical
tests.
A
somewhat
different
used for an investigator who comes with an imprecise proposal for a clinical research project. After a few quesevasion
is
i
malpractice
Statistical
—and
the responsibility of a consultant
143
about confounding variables and hard the consultant says, "Let me have your protocol," which can then be contemplated and manipulated in the privacy of the statistician's theories about
sources of the fever or the anxiety. But
"experimental design."
dosages, "pure" populations, randomization,
tions
antibiotics
endpoints,
anxiety,
In both types
evasion,
the
for
fever
or
tranquilizers
for
without probing deeply into the
may
statistical
consultants
catliedra
pronouncement,
also,
an ex
in
prescribe
fixed
statisti-
double-blind procedures, dimensional mea-
cian deprives himself of an adequate his-
surements, and hard endpoints, without de-
and
tory
of
formulation
scientific
the
of
problems. Working in a Procrustean man-
may
he
ner,
cal tactics
to
take his knowledge of
and adapt the research problems
those
fit
tactics,
instead
or adapting his tactics to
choosing
of
the problems.
fit
The management of therapy
3.
With advances
may sometimes tory
tests
think.
has
statisti-
The
had
order myriads of labora-
rather than
take a history or
availability of digital
similar effects
on
computers
statistical
con-
statistical
when
In the "ancient" days
sultants.
the
in technology, clinicians
computations
had
to
all
be
done with a desk calculator, a statistician had the incentive to be selective in choosing analytic tests, if for no other reason than to spare himself the labor of the calculations.
data were available for 14 vari-
If
deemed would analyze
only two of which were
ables,
important, the statistician those two variables.
Today, with almost no effort, a statistician can get a computer to massage all 14 variables in diverse ways, to perform "factor analyses" on the lot, and to prepare a matrix of correlations for each variable with every other variable. Receiving the enormous pile of print-out, the investigator may have to spend days or weeks searching for what he wanted. He may not find it at all, but the magnitude of the computations will usually con-
him
vince
that the
consultant has
done
different
statistical
counterpart
of
when
the
defective clinical practice occurs
statistician "treats" the protocol of a clini-
cal
trial
careful
without
having
"diagnosis" of
what
established
a
wrong
or
is
to be done. Clinicians may sometimes, in "shotgun" manner, dispense
what needs
armamen-
tarium" will distort the objective of the research or invalidate the results.
The
practitioner of this type of statistical
shotgun therapy convince of his
may be
as
difficult
malefactions
clinician
who
quilizers
indiscriminately.
uses
antibiotics
As
as
and
to
the
is
tran-
justification
for the treatment, the clinician
can usually
point to the patient's subsequent improve-
ment, and the clinician
censed
if
his
judgment
is
may become
in-
later questioned
having caused possibly needless expense and risk to a patient who might have either improved without the chosen drugs or improved more rapidly with others. As justification for shotgun statistical treatment, the statistician can usually offer certain obvious improvements of a faulty protocol, and he may become incensed if his efforts in bringing wellfor
established statistical principles to the project are decried as being so statistical that the results are
consummately not meaning-
ful in clinical science.
A
shotgun statistical design of a clinican create many problems that have been discussed in detail elsewhere. 1115 cal
trial
The
basic
of the trial
difficulty
become
clinical reality, tivities
is
the
that
results
useless for scientific
because crucial
clinical ac-
were either omitted or distorted
in order to
approbation.
a "thorough" job.
A
termining whether these standard agents of the statistician's "therapeutic
fit
the
demands
A drug
that
of statistical
ordinarily
quires flexible dosage has not
properly
if
re-
been tested
given in fixed dosage.
If
pa-
with comorbid diseases are excluded from a trial in order to test a "pure" population with the "main" disease, the results may be free of confounding variables but also free of any realistic signifi tients
Other architectural problems
144
cance for the "impurities" constantly encountered in patients. The results of a trial cannot be related to clinical practice if prognostically heterogeneous patients are combined, randomized, and analyzed as a single group without previous or subsequent division into homogeneous prognos-
The
the risk of producing meaningless science.
This type of consultative advice
mount
who
patient
a
to
physical
exercise
therapy.
A good
allow
not
and
to giving only pills
is
tanta-
injections
needs
vigorous
intensive
psycho-
really
or
consultant will
clinical
pharmaceutical
substitutes
to
wasted are used
need to work out his own problems, and a good statistical con-
and everyone's time is wasted when double-blind tactics are needed but omitted in circumstances in which their application might have been
sultant should not allow clinicians to es-
tic-
Strata.
investigator's time
is
when double-blind procedures unnecessarily,
requiring
difficult,
second
a
ticipating observers.
The
set
of
par-
objective of the
research becomes distorted
when
the quest
for dimensional
measurements creates the in which the index variables are transferred from the qualitative facts that the clinician really needed to know into unimportant data that he can measure. The conduct of the trial becomes an "ordeal " when the desired clinical targets that are abundant in "soft "substitution
game."'"
1
uncommon may require
data" are displaced into
data" lessly
"hard
endpoints that needprolonged observation of needlessly
huge numbers of patients. Aside from the practical difficulties created by shotgun designs, the continued acceptance of such procedures is detrimental to the progress of clinical science.
One
main reasons
of the
for the neglected
prognostic heterogeneity, displacement of
and other defective
targets,
cited
using data.
is
tactics
just
that the statistician wants to avoid
imprecise
He
concepts
and unreliable
eliminates patients with comorbid
diseases because the clinician has not ade-
replace a
cape
patient's
the
inertia.
and prognostic
strata
make
that clinicians
If subjective
so.
are
"harden" the important "soft" data while
analyzed because clinicians have not de-
and
the
"soft
specifications
data"
variables
of
the strata;
are
displaced
because the data are neither observed nor interpreted in a reproducible manner. Rather than accept information that has little
tician tific^.!
or no
may lv
scientific
validity,
prefer to do
and
the statis-
what seem
statistically
valid
scien-
—even
at
clinician
has not vet developed the intellectual
"muscles" that will enable him to "walk"
he
if
develop
will never
scientifically
avoids
constantly
by being pushed
effort
in
this
skill
necessary
the a
statistical
"wheelchair."
By absolving
the clinician of the need
own important data and judgmental activities, statistical consultants perpetuate the atmosphere that evoked such scientific and clinical distress about statistics more than a century ago. Said improve
to
Trousseau, 33
Armand method it
his
is
the
"This
scourge
(statistical)
the
of
intellect:
transforms the physician into a calcu-
machine, making him the passive which he has massed
lating
lineated
A
preserving them for analysis.
who
slave of the figures
not
are
do
"soft" data
they should not
important,
critically
and
be replaced by what is objectively measured, "hard." and irrelevant. The statistician should demand, instead, that the clinician develop the observational procedures, indexes, and criteria 10 that will
quately classified the diseases or their constrata
own
their
suitable efforts to
variables
sequences;
prognostic
of
comorbid diseases have not been proper-
If
the statistician should insist
classified,
ly
defects
scientific
intellectual
up.
.
.
.
You wish
the pupil to see only
crude facts and to stifle his intellect: and when, by means of this dismal labor, his mind has been to some extent mutilated, you will ask him to show mental vigour
.
Claude scientific
our
.
.
and
Bernard, 2 ideas
power,
thought."
prolific
"Against
we must
because
these
Said anti-
protest with
they
help
to
all
hold
malpractice—-and the responsibility
Statistical
medicine back in the lowly state in which it has been held so long."
Because so many human
either
all deaths are checked necropsy or with other types
at
and done
self-limited
what
is
every
clinically,
are
no matter
subside
will
medical
doctor will usually have a high "success" in
rate
therapeutic
his
natural rate
is
activities.
This
augmented by the dramatic
achieved with antibiotics,
cures
surgery,
and other modern therapeutic agents, and even
when
cure,
his
therapy
the doctor's
to
fails
and compas-
personal concern
can provide comfort. The freqency of successes make most patients devoted to their doctors, and the emotional overtones associated with human sickness may make many patients devoutly worshipsion
these
ful of the consultant
provide
to
This
main
relief or
whose work seemed
remedy.
type of deification
A
is
A
clinician
activities
who
participates
teaching
a
of
will receive frequent interrogation
with house
staff
and "disclosure"
and
in-
and students. The various
demands
recent
hospital
during his "rounds"
tellectual provocation
suitable
for
to patients
instruction
have improved
the ethical tactics of clinical investigators,
and the spreading epidemic of
made
for malpractice has
more
all
legal suits
practitioners
and explaThese activities alone are obviously not adequate to thwart the development of unrestrained arrogance in a clinician's exercise of the power delegated to him, but they can help. And although a clinician may be able to "bury his mistakes," he is seldom able to hide them from himself or his colleagues. careful of their decisions
nations.
A
one of the
hazards of a clinician's
intellectual
the
in
illnesses
145
for ascertaining that
of evaluation.
The arrogance of power
4.
of a consultant
statistical consultant,
no such
restraints
by
contrast, has
in his professional ac-
Although a clinician's basic personality and character are his main prophylaxis
There are no types of licensure procedures after he receives his academic degree; there are no routine reviews by outside critics of his professional performance; there are no vounger minds regularly available or assigned to probe and question his consultative judgments in each of his tasks; there is no code of ethics to keep him from "experimenting" without adequate explanation to his "patients"; there is no threat of legal action to make him worry about engaging in malpractice. The statistician need not even bury his mistakes. If adequately glorified, they can be pub-
against this behavioral malady, certain pro-
lished.
occupation.
thoughtful clinician, recog-
nizing the deifying propensity of his patients,
may
often
use
it
therapeutically,
but he must constantly beware that he does not begin, consciously or unconsciousto believe in all the power, glory, and omniscience that have been attributed to
ly,
him. to
Not
resist
all
this
clinicians, belief,
however, are able
which may lead
various forms of arrogance in the clinician
thinks
about
ceives critical appraisal,
to
way
a
re-
his
activities,
and
interacts with
other people.
can help protect him and the public. He must pass standard examinations to be licensed as a practitioner and additional "boards" to be certifessional safeguards
fied
as
a specialist.
An
internist
receives
frequent diagnostic "exposure" at clinicopathologic
conferences,
and
a
surgeon's
removal of tissue is constantly reviewed by special committees. The procedures for accreditation of hospitals provide a means of reviewing
medical record-keeping and
tivities.
or
accreditation
Moreover,
come
although
increasingly
patients
wary
have be-
of their
clinical
and are less likely to engage in abject, mute worship, the statistician's clientele now seems more credulous, taciturn, and reverent than ever before. In 20 identified the 1932, Major Greenwood consultants
early stages of this trend:
Even writers
the
as recently as
would
demand
still
20 years ago, medical
challenge
peremptorily
that their data should
be treated
146
problems
Oilier architectural
at
istically
all
.
seeks for
some
the
application
of
"significant"
(but) the writer
.
.
now
which
result--.
.
believing
to
trillers
patentees of more
This
that
< i
improvement
the
qualifii atJons
and
inference,
but
area
the
in
sideration
reflection
too
all
of
his
The
clinician has
many
.
in
his
to
reasons for this re-
consultant.
statistical
Some ol the reasons are the external demands imposed by the contemporary lashions ol editors
the
ol
the
senting
clinician's
and granting agencies, hut reasons
are
insi entities
absence
repre-
internal,
caused
by
the
training in scientific
of
maintain
to
or
may
process
large
be delighted delegate the entire planning of these of data,
he-
also
statisti-
own ignorance about
methodology and' data processing, impressed by the statistician's credentials in "experimental design," and obsessed by the hope that the statistical procedures will oiler panaceas for all ailments in research, the clinician "patient" is
often deeply grateful merely to be ad-
mitted to the office of a statistical "doctor."
Anything given as help thereafter is humbly and reverently, particu-
received larly
if
leads
it
the
to
acceptance of a
methodology and by misdirected training
paper
in
posal for funding. Since the clinician
statistics.
The
likelihood
a statistical consultant in
homage
gullible
of
is
to
particularly high
populational research such as clinical
trials.
A
laboratory
clinical
who
inyestigator
research
usually
has
ideas about experimental design
does
needed
that to
epidemiolo-
may
trials,
to
sel-
done
seldom critical of it, be critical, his only lament max be about the statistician's oche
statistically,
and,
is
when asked
to
r
may be due about
to
statistics,
comments
that
the
clinician's
insecurity
may also fear any might make him seem an but he
"ungrateful patient."
Receiving
this
awed
idolatry from clini-
cians xvho are themselves regularly idolized,
the
ticularly
consultant's
susceptible
xeloping the
consultant to
the
same forms
of
risk
is
of
par-
de-
intellectual
may occur in any authority pronouncements are regularly acxvhose corded an unquestioned omniscience. arrogance that
controlled
remedy the de-
fects of earlier therapeutic studies,
knowing how
been
casional delays in gixing help. This reluc-
have been obtained in laboratory activities, in which precise objective data can be obtained with equipment constructed by an engineer, requiring no demands on the clinician's ability in clinical observation and
Knowing
what has
understands
tance to appraise the quality of the help
gy. His previous research experience
clinical trials are
dom
definite
and usually asks the statistician afterwards only for technical assistance in performing the analytic tests. The clinician who does populational research, howexer, may have had little or no exposure to
interpretation.
for publication or of a research pro-
and analy-
sis,
scientific principles of clinical
for
statistician's
scientific
.
delegate authority and
to
sponsibility
.
the-
has had almost no training
clinician
how amounts
cal-computer group. Thus, aware of his
the
today
to
chores to the statistician or to the
con-
experimental
responsibility
tiiese ate. is to the statistician.
willingness
careful
into
frequently
abdicates
investigator
many
the
to
holds no special
the statistician
defective
readily accede,
science,
of
in
:
rtainly
clinician's
may
removal or denigration of "soft" clinical data and judgment. Furthermore, since
powerful magic. to
sake
the
are
statisticians
the
to
methodology, he
by Remington and
current state described
Schotk
of "experimental design." Because laboratory investigation brought no
were
statisticians
now advanced
has
trend
knowledge
.
no! thinking at
to
less
f census data about the number of people live in 1920 and in 1960. Our two numeratoi contain the number of people go;
life.
For example, accord-
of people with
while
dom
alive.
Since thrombophlebitis
a fatal disease,
mildlv
and
escape
diagnostic
victim
does
it
transiently
is
sel-
can often occur in
detection
episodes
that
because the
not seek medical attention. if we were planning to use
Consequently,
.
The rancid sample,
any one of these three diseases as the numerators in a scientific survey, we should ascertain that the compared denominator populations were exposed to the same frequency and intensity of such procedures as electrocardiograms, chest x-rays, routine
physical examinations, and other appropriate diagnostic tests.
Within the past 25 years, political pollhave twice been badly stung by neglect of the bias caused by "tilted targets." Although reasonably random samples were used for the denominator populations in polls taken before the United States national election of 1948 and the British national election of 1970, the careful sampling procedures were sabotaged with errors caused by two different types of tilting in takers
the target data.
In the recent British election, the polled
seemed more inclined to vote Labor than Conservative, but the pollsters did not ascertain whether the two groups of voters
had the same
potential voters
intention of
carrying their beliefs to the target of expression at the voting booths. tion
Day came,
When
the Conservatives
cause their supporters were
more
and
the tilted target,
Elec-
won
be-
the medical poll-hearer
}77
Perhaps the single greatest flaw in contemporary epidemiologic concepts and methods is the neglect of possible problems caused by a tilted target. Although the idea of random sampling has been developed and promulgated as a method for removing bias from the denominators, no substantial methodologic attention or investigation has been given to the possible distortions caused by bias in the numerators. Both of the cited problems in the tilted targets of political polls the unequal registration of target and the temporal change of opinion have many counterparts in epidemiologic surveys. These two scientific hazards have been discussed in greater detail elsewhere 12 and will be only outlined here. The problems of unequal registration occur in groups of people followed concurrently whenever an initial feature of one of the groups or of the "causal agent" produces a disparate intensity of medical examination in the two groups. For example, an unequal frequency and scope of the examinations used to detect disease in compared groups may be responsible for
—
—
some
of
the
differences
the rates
in
of
diligent
coronary artery disease found in executives
than those of Labor in reaching the target
versus laborers, in the rates of lung cancer
ballot box.
As
a result of the spectacular
error in the predictions for this election,
we
can expect future political pollsters to inquire not merely about the opinions of the electorate,
but also about the likelihood
in
smokers versus nonsmokers, and
rates of thrombophlebitis for
use "the likely
pill"
versus other forms of contra-
Executives
ception.
in the
women who
are
generally
more
than laborers to receive routine elec-
that the potential voters will actually vote.
trocardiograms and other periodic "check-
In the American election of 1948, the target
ups." Because of the development of chron-
'was tilted in a different manner. Because
one of the candidates (Thomas E.
seemed rival,
to hold a
Dewey)
commanding lead over
his
the pollsters did not continue testing
beyond
late September and early October. By the time Election Day arrived in early
ic
cough, smokers
may be more
likelv to
receive chest x-rays than noncoughing non-
smokers.
needed
The increased medical
attention
renewing prescriptions or checking other hormonal problems may create for
more routine examinations
for
women
women
tak-
November, however, many voters apparently changed their minds, and cast ballots
ing the "pill" than for
contrary to opinions previously expressed
targets in disease detection
in the polls. 20
in these groups for the reasons cited, the
S
In a stunning "upset," Harry
Truman won
the election. Consequently,
political pollsters
States)
now
(at least in the United
continue their
close to Election
Day
activities
as possible.
as
using other
forms of contraception. Although
all
may be
of the tilted
tilts has never been inand the possible magnitude of the bias in these and other epidemiologi surveys is currently unknown.
existence of the vestigated,
Other architectural problems
178
The
temporal changes
effects of
populations,
The
ing the past half century, for identifying
different
problem
The
problem
occurs
targets.
in opin-
in
tilted
whenever
mortality rates for a particular disease are
compared
nonconcurrent
in
observed in different years or mortality rates
for
different
eras.
diseases
are
based on the data reported in death certificates, but a death certificate does not list diseases;
it
clinicians.
The
the diagnoses
lists
"vital statistics"
names
an alteration solely in the nomenclature, but many other diagnostic changes are due to the new technologic agents that have been developed, with increasing frequency dur-
ion create a
made by
derived each
illustrate
fashion
of
"disease."
diagnostic
Such procedures
as roentgenog-
raphy, biopsy, endoscopy, exploratory surgery, electrocardiography, electroencepha-
lography, chemical measurements,
immuno-
logic reactions, the various assessments of
year from death certificates do not indicate
microbial agents, and the entire phantas-
annual changes
magoria of contemporary laboratory tests are almost all a product of the past 50
in
rates
of disease;
they
indicate a changing rate of diagnosis.
Since
a
clinician
examines not a
"dis-
and increasfew decades.
increasingly available
years,
ease" but the clinical condition of a pa-
ingly used during the past
most contemporary names of "disease" depend not on observed evidence, but on
These new diagnostic adjuncts can be expected to have both an immediate and a delayed effect on the occurrence rates of
tient,
interpretations, or technologic expansions of the actual bedside evidence.
inferences,
Consequently, the "diseases" reported on death certificates will vary both with the diagnostic fashions
popular among
clini-
cians during any particular era in medicine,
and with the diagnostic tools and tactics available for assigning names of "disease" to the
observed
clinical conditions. If the
and techniques of diagnosis can be shown to remain the same as time procriteria
gresses, then the
annual
statistical tabula-
tions may actually reflect an altered natural occurrence of the disease. But if the con-
cepts and technology of diagnosis should change with time, the changing rates of many "diseases" noted on death certificates may be an artifact of target tilting, due to temporal changes of diagnostic opinion, rather than to new conditions of nature. A simple example of this distinction is
the disappearance of the disease dropsy,
which was held responsible for so many deaths a century ago, but which seldom seems to kill anyone today, according to modern mortality data. The clinical condition of dropsy
is still
of course, but
it
is
present in abundance,
now
called
by other
names, and its fatalities are usually attributed to some form of cardiac, renal, or hepatic disease. The modern changes from dropsy to congestive heart failure or other
the "diseases" that they identify. The immediate effect is on the "diseases" diagnosed at the medical centers where these tests are usually first developed and adopted; the delayed effect takes place over
many
years as the
new
adjuncts are slowly
disseminated into use by practicing physicians in the communities cal centers.
tion
The
new
of these
beyond the mediand dissemina-
availability
diagnostic procedures
from one year to the next, and certainly from one decade to the next, creating major changes in the clinical opin-
will often differ
names of "diseases." The temporal changes in diagnostic tac-
ions that are offered as
tics
are
particularly
certificate
important in death
data, since the clinician's final
statement on the certificate
is
seldom sub-
jected to revision or confirmation
by
ne-
performed for only about 20 per cent of deaths in the United States today 22 it does not always yield a clear diagnostic answer when performed; and its results do not always appear on the death certificate, which is usually filled out before the necropsy is done. Consequently, the "vital statistics" used in so much of epidemiologic reasoning depend mainly on cropsv.
Necropsy
is
;
the diagnoses
made by practicing
clinicians.
Although temporal alterations in dards can greatly affect the way that
stanclini-
The rancid sample,
when
cians "vote"
lite
identifying "disease,"
no
major surveys have been done to study the changing incidence and prevalence of the intellectual fashions, paraclinical tests,
and the medical poU-bearei
lilted target
and
diagnostic criteria that so greatly influence
members
stones as
179
of the hospitalized dia-
betic population?
Another
based on neglect failure to check for a tilted target, occurred during a major statistical study of a necropsy population. 24 Patients who died with and without cancer at necropsy were carefully matched for classic error,
of Berkson's fallacy
and on
which "diseases" will be designated, when, and where. Even when epidemiologists acknowledge that the availability of new diagnostic tests may change the occurrence
sex, race,
of certain "diseases," the temporal dissemi-
tive tuberculosis
nation of the tests from academia to com-
cent of the cancer patients but in 16.3 per
munity
is
cent of the noncancer group, the investiga-
not considered.
The Surgeon
General's
report 27
tors provides
an-
other useful example of the limited awareness of of these hazards. According to that report,
some
between 1947 any new identification tests because "there were no significant advances in diagnostic methods" during that period. 28 The presumptive reason for this statement is that exfoliative cytology techniques were first the rising incidence of lung cancer
and 1960 could not be due
to
described in 1945. What is ignored in the statement, however, is not the mere existence of a test,
but
its
dissemination.
Exfoliative
cytologic
sputum were not in use at many hospitals until at least a decade after Papanicolaou introduced the test. As the sputum "pap" smear became disseminated to practicing doctors, it would help identify many cases of lung cancer that had previously escaped detection. The dissemination of a new test, rather than its mere studies of
could thus lead to a rising annual incidence of a disease. This possibility, and many existence,
other aspects of changing diagnostic rates for disease, are regularly overlooked in the type of rea-
soning used in the Surgeon General's report.
rather than general populations. In a celebrated account of the error that is now
known
as
showed
that differential bias in the rates of
hospital
admission
fallacy",
could
cause
concluded that tuberculosis was antag-
onistic
Berkson 4 spurious
seems to have been forgotten by the many epidemiologic investigators today who engage in similar acts of retrospective "matching." As Mainland 21 describes the denoue-
ment 23 Then
of the situation, a serious flaw
did
many
of
the
other
strate
equality
members
of
target
detection:
did
of the general (or the "control")
population have the same prevalence of opportunities
and
tests for
detecting gall-
—the
pos-
diseases,
and therefore
trating thinkers in his field.
The
fallacies
political
created by tilted targets
become
pollsters,
painfully
evident
to
and
to
to pathologists,
Although epidemiologists may privately acknowledge the hazard, it is not given major attention in epidemiologic clinicians.
textbooks, in the literature of epidemiologic research,
and
in
such analytic reviews as
those contained in the Surgeon General's
A
century ago, convinced of the
found more diabetic patients than in a "control" hospital population, we can not conclude that gallstones and diabetes are associated. We would first have to demongallstones are
hospitalized
of
victims quicker than
statistician responsible for the research method and the conclusions was one of the most pene-
therapeutic value
often in
its
gave a person less time to develop florid tuberculosis such as was found in members of the noncancerous group. The concept of a cancer-tuberculosis antagonism was quickly dropped; but we ought not to forget the story, because the bio-
report.
if
was thought
cancer killed
sibility that
associations in the concurrence of diseases.
Thus,
Extensive efforts were
cancer.
to
then instituted to use tuberculin in treating neoplastic disease. This old "blooper"
have thus
The hazards of tilted targets are not unknown to people who work with clinical
"Berkson's
and date of death. Because acwas recorded in 6.6 per
of blood-letting,
many
argued that therapeutic trials or other proofs of the procedure were not necessary. Epidemiologists today seem equally convinced that no tests or precautions need be taken to ensure against clinicians
the distortions introduced
by a
get. In clinical medicine, the
tilted tar-
value (or
lac
of value) of the old therapeutic doctrn
1
Other architectural problems
180
was demonstrated as soon as satisfactory scientific studies were done. In epidemiology, some of the current doctrines may also become altered when chronic diseases are with
investigated
scientific
satisfactory
corded on death certificates, but the efforts have always been biased toward the detection of false positive diagnoses only. 10
when lung cancer appears on a certificate, investigators may seek
Thus,
death
substantiating evidence to ensure against
methods that arrange for the effects of bias be assessed in the numerators as well as in the denominators of the compared sta-
a false positive diagnosis, but the investi-
tistical ratios.
negative diagnoses. As noted earlier, about
to
do not explore the problem of
gators
false
20% 3.
An
of data
Reliability
additional defect of the polling tech-
niques used in the
Dewey-Truman
dential election of 194S
was the
presi-
pollsters'
assumption that the "undecided" vote would be split for each candidate in the same proportion Since
as
Dewey
the
"undecided"
ballots.-
among
held the lead
"de-
of lung cancers found at necropsy have not been diagnosed during life. Nevertheless, epidemiologists have not performed
careful tests to note the
may be cancer
distorted in
way
that their data
by the occurrence of lung
people for
whom
the diagnosis
was not recorded on the death certificate. These problems in reliability of data are major
scientific
hazards for the
cided'' voters in the poll, his share of the
demiologists and statisticians
'undeeided" group would maintain a margin of victory.
Ktie
presumably
On
Election
Day, however, Truman was the choice of most of the "undecided" voters, who had apparently been reluctant earlier to express what seemed to be an unpopular belief. As a result of this problem, the interviewers
now
many
many
epi-
whose ana-
work depends on the mortality data assembled by the Bureau of Vital Statistics. An epidemiologist who performs direct populational surveys may not always be as scientifically careful as Deming 6 recommends, but at least he does active research.
He must
design
a questionnaire, find a
additional
population, get the questions to the people,
questions to assess the reliability of the
receive their responses, and evaluate the
in political polls
use
responses.
replies they receive.
The last of
of reliability in data
issue
is
the
the major scientific defects that beset
contemporarv epidemiologic statistics. Although most scientific investigators will ordinarily go to great lengths to check the reliability of the data with which they work, a corresponding concern has not been evident
in
epidemiologic surveys.
When
By
contrast,
data need not
tality
an analyst of mor-
make any
plans for
and need not even leave his desk. All he has to do is wait for the Bureau of Vital Statistics to receive and tabulate death certificates, whose design is standardized and whose preparation is a
collecting information
traditional obligation of clinicians. Passively
receiving the
Bureau's tabulations about
becomes
the data are obtained from mailed ques-
rates of death, the epidemiologist
seldom determine whether the questions were understood or correctlv answered by the recipients. A second set of identical questionnaires is rarelv mailed to determine whether the repeat responses are consistent
a poll-bearer, rather than a poll-taker.
with those received in the first set. When the data are obtained from death certificates, such target variables as the diagnoses of disease are not thoroughly checked
logic statisticians often regard
tionnaires, the investigators
for accuracy. is
made
to
From time verify
to time,
certain
an
effort
diagnoses re-
One obvious
source of unreliable data in
these poll-bearing activities skills
is
of the diverse clinicians
the unequal
who
fill
out
the certificates. Instead of carefully check-
ing this source of unreliability, epidemio-
unimportant. The belief
assumption
that
the
is
it as being based on the
vicissitudes
clinical data-recorders wall create errors, tions.
with cancelling
of
the
random
effects in all direc-
This assumption might be acceptable
The rancid
.sample, the tilted target,
the data were free from any systematic
if
sources of general bias. At least three main sources
systematic bias, however, de-
of
stroy the blithe assumption that the only
problem to be considered (or ignored) the diverse diagnostic
One such earlier in this
source
skills
been discussed
has
essay and elsewhere 10
changing standards of
is
of clinicians.
:
the
technology,
criteria,
and concepts of disease from one era to the next. Another source of bias is the bizarre
oscillations
in
disease
categories
imposed by changes in the hierarchial ratings and coding rubrics assigned to the diagnosed diseases each decade by statisticians. 7 10, 16 29 The third major source of systematic bias is the excessive rigidity and general inadequacy of the death certificate as a "questionnaire." William Farr, who is generally regarded as having founded -
-
modern
"vital statistics" a
century ago, was
vigorously opposed to inflexible classification procedures that did not allow for ade-
quate description
Farr, 9
and the medical poll-hearer
181
morgue, no exabout the much more frequent situations in which a specific single cause of death cannot be identified. Sometimes no cause is evident; sometimes there are too many candidates; sometimes death was due to a conditon such as cardiac arrhythmia or diabetic acidosis requiring detection with special functional or chemical tests that cannot be applied in the routine morphologic examinations at necropsy; sometimes a leading suspect is present, but the actual mechanism of death is uncertain. The selection of the cause of death is an extraordinary difficult task, particularly for patients with brilliantly "solved" in the
citing
are
articles
published
—
A
chronic diseases.
good
clinician or pathol-
can often identify most of the diseases that were present, but he can seldom say exactly what was the precise cause of death. He often does not know; nor does ogist
anyone
A
else.
clinician (or a pathologist)
may
thus
imperfect knowledge has an obvious tend-
which disease caused death in a patient who had widespread cancer of the colon, advanced cerebral
ency
arteriosclerosis, multiple
"The
doubt.
of
Said
refusal to recognize terms that express
to
encourage reckless conjecture." seems to have been unheeded
Fair's advice
by
his successors.
In
its
produce a specious
specificity. It
no request for evidence or cited
diagnoses;
makes
criteria for the
no room for demands a statement
allows
it
honest doubt; and
it
not only of the disease regarded as the
cause of death, but also a tributing
tions,
and
luctant to
current format, the death certificate
appears to have been deliberately designed to
be unable
diseases,
list
of the con-
arranged in their
se-
to decide
diffuse
make
a choice, the clinician
be an undecided
thus familiar with the problems of the "undecided voter." In many instances, a definite disease has
not been identified,
and, in
many
other
instances, a single cause of death cannot
be selected from the array of diagnoses. Even when necropsy is performed, the issue of "cause of death" is seldom easily resolved. Although newspapers and magazines often describe isolated cases that
were
is
obligated to
lists
as a case
of either heart disease, stroke, cancer, or
something Difficult
pathologist
else.
as
the problem
who
may be
for a
has the open body before
death
is
He
patient enters the statistical
him, the situation
out a death certificate
voter.
choose a cause of death, so he picks one, and, depending on his arbitrary choice, the
quential order of lethality. Every clinician filled
is
not permitted by the death certificate to
who
has ever
myocardial infarcbronchopneumonia. Re-
is
much worse when
must be
the
most of them are, in circumstances distant from the diagnostic paraphernalia of a major certificate
filled out, as
medical center. The clinician is often not exactly certain of any diagnosis, let alone an exact cause of death. Yet he is not permitted to state, when necessary, that the patient died of uncertain but apparently natural causes, nor can the clinician
make
any other allowances for doubt in listin; the sequence of lethal diseases. The
182
Other architectural problems
form of the death all
must be
in
filled
disease
particular
demanded, with
as
cited
and
does
not
precisely
may
may even
state
make
muss.
a
Generally an intelligent, educated, thoughtful guess; a guess that seems as reasonable
and
with the diag-
as consistent as possible
nostic concepts of the era; a guess that
often
be
or
right
wrong
often
may
— but
a
Anyone who has ever experienced the realities of clinical medicine knows
guess.
about these guesses, and about the way they may be biased by the fashions of the doctor's training
and
environment. 1,
his
10
The
results of this guesswork continue, however, to be assiduously collected, in-
dustriously tabulated,
and the product
of the
technologic
to
to inconsistent guesses
and conclusions by
that undecided voter:
the clinician.
the
may company
refuse to accept the body.
So the clinician often
how much
to nature,
of the
delay payment, and the funeral di-
rector
due
changes, to preventive measures, to thera-
cause of death, the surviving family think him stupid, the insurance
is
peutic advances, to fashions of opinion, or
demands
in all the exact
fill
impossible to decide
a
the clinician
If
it is
change
cause of
of succession for
the "contributory" diseases. fails to
the
as
and with an order
death,
form
with
certificate is there,
structured spaces, and each space
its
and avidly analyzed,
Not
all
modern epidemiologists perform manner described here,
"research" in the
and many new surveys have been carried out with methods that improve or remove the cited defects.
Some
of the
new
"active"
have even openly acknowledged the scientific disparity between preaching and practice in the older, "passive" epidemiologic research. According to Wright and Acheson, 11 investigators
the
epidemiologist
maintain his admit that, as (it tin as not, the circumstances underlying the judgments made by him and his colleagues do not quite come up to the standards which he himself may indicate to be desirable in his teaching. If
field
is
he must be prepared
integrity,
to
to
Regardless of the efforts being
new
made
in
investigations, the vast bulk of exist-
called "vital statistics."
ing epidemiologic statistics about incidence,
documenta-
tion of the
prevalence and causes of chronic disease has been assembled with the apparent
certificate provides a civilized passport to
For many major
the disposal of a dead bodv, but not a
ence, epidemiologic statisticians have pre-
A
is
death certificate today
is
a
name, age, and sex of a person, and the time and place of an event; the
identification
scientific
of
Until
disease.
major improvements are made
in the collec-
philosophy of "better bad data than none." activities in
medical
sci-
work with quantitative data rather than with unquantified recollections alone,
ferred to
the data, the information about individual
and have hoped that "good judgment" would be used in recognizing and evaluat-
diseases cannot be used in serious scientific
ing the defects of the data.
tion,
organization,
and standardization of
Nevertheless,
activities.
assessment
of
the
without
multiple
careful
human and
technologic fashions that can distort the
and "causes" of death, the have been confidently used as the basis for gauging the changing frequency of diseases, for major statistical studies of various "causal associations," and cited diagnoses certificates
for
new
etiologic
epidemiologic explanations
surveys
for
seeking
different
geo-
The hope that good judgment will remedy poor science has been a wishful dream throughout all of medical history, and the annals of medical history are filled with the ruins of the dream. In the evaluation of clinical therapy, clinicians for cen-
used a fallacious philosophy of post hoc ergo propter hoc to justify all the
turies
blood-letting, blistering, purging,
and puk-
ing that were once regarded as the basic
graphic rates of chronic disease. As the
staples
mortality rates for various diseases rise and
etiology of disease, clinicians used the
fall in different
countries at different years,
of treatment.
In concepts of the
same
post hoc or other defective philosophies to
The rancid sample,
create beliefs in the angry gods, unfriendly demons, deranged humors, contagious mi-
183
asmas, visceral inflammations, dystonic blood vessels, defective teeth, and toxic
cannot be gauged from body counts alone. But in epidemologic activities, clinicians have accepted the results of body counts assembled without scientific verification of
organs that have incorrectly
either the procedures or the data.
to
I
and the medical poll-bearer
the tilted target,
modern times
—been
— from ancient
To
held responsible
obtain satisfactory medical data from
for causing disease.
a valid
random sampling
During the past few decades, the fallacies of post hoc reasoning in therapy have come under scientific reappraisal. Clinicians have recognized the need for appropriate comparison and quantification, and have begun to apply scientific methods in both surveys and trials of therapy. The methods still require many clinical and statistical improvements, but the defects of the old approaches have been amply recognized, and the need for better scientific methods is
ulation
is
cepted, thoughtful medical scientists will be uneasy about the wisdom of allowing the heroic diagnostic and therapeutic advances of the past few decades to be accompanied by less than heroic epidemiologic
well accepted.
research.
The
fallacies of post
hoc etiologic rea-
The
acquisition of a truly random samalmost impossible for a large general population. With appropriate ingenuity and
provide experimental evidence for causes
suitably selected strata
methodology has been needed for the chronic diseases that are prime targets of contemporary research. This methodologic vac-
population. 6
uum
has been
filled
by
statistical
proce-
dures.
What cific
is not such spewhether coronary disease is
at stake here
is
issues as
caused by physical inactivity, lung cancer '.by cigarette
smoking, or thrombophlebitis
clinicians,
is
effort,
however, satisfactory samples can be from random choices within
attained
shufflings of the terms
erroneous therapeutic and etiologic reasoning, are to
be led
uncritically into a
new
used in coding. The
must be changed to correspond to clinical realities, and connoisseurs of modern clinical and pathologic diagnosis should be enlisted to help explain those realities to the coders and other statistical personnel. 10 " Perhaps the most easily re'
defects
is
statistical
certificate itself
mindful of their long history of
issue
clusters of the
of death certif-
manipulations of the data and by decennial
moved
The main
and
The problems
cannot be solved merely by
icates
whether
•by the "pill."
and
proached with convenient but defective techniques of sampling and data collecting. Although the results have been widely ac-
ple
of infectious disease, a different
human popdifficult
almost heroic epidemiologic task. Because the task is so formidable, it has been ap-
have received neither the same recognition nor commensurate efforts at improvement. Although Koch's postulates could be used several generations ago to soning
of a
an extraordinarily
data.
of the existing difficulties are the
due to target tilting and unverified These problems can be solved as soon
epidemiologic investigators recall that
as
based on elaborate numerical analyses of
—rather than the primary principle native conjecture—
bad
of scientific research.
philosophy in which diverse characterizations
of
disease and causal "proofs" are
scientific data.
In diagnostic
now begun
clinicians
activities,
have
to use extraordinary precision
for identifying diseases. In therapeutic activities, tific
clinicians
procedures
have begun to use scienfor
evidence, and have
assembling
begun
suitable
to recognize that
in treatment, as in guerrilla
warfare, success
valid
evidence
statistical
theory, magisterial computation, or imagiis
The
obliteration
of science during the
numbers was deplored a half century ago bv Major Greenwood, 14 one of the founders of modern epidemiology:
collection of
Because the
results
of their labour are useful,
the compilers and analysers of these statistics are
no more entitled
to
rank as scientific investigators
Other architectural problems
184
who manu-
than are the equally useful artisans facture our laboratory apparatus.
epidemiologic disease, 4.
Modem tific
who
investigators
method
as a
regard scien-
fundamental necessity in
epidemiology will also note the recommendation of Bradford Hill 5
Berkson,
J.:
fourfold
table
5.
:
Deming, W.
New
One must
back.
.
.
.
go seek more facts, paying less attention to technique of handling the data and far more to the
development and perfection of methods
7.
E.:
Some
York, 1966,
Dover
theory
of
sampling,
Publications.
(
Paper-
)
Dorn, H.
of obtain-
F.:
Mortality, in Lilienfeld, A. M.,
kins Press, pp. 23-54.
W.:
8. Farr,
For such
main challenges
investigators, the
of scientific epidemiology will be found not
arm
in the speculator's
be to develop suitable strategies for choosing a proper sampling of population, persuading and examining the selected people, validating the tests and other procedures for gathering data, maintaining liaison with the population during the follow-up period, and performing satisfactory examinations of the subsequent state. Additional incentive can come from the 13 classical exhortation of Francis Galton will
:
on the
elevation
fa-
15:155-183, 1852.
W.: Quoted in Greenwood, M.: The medical dictator, and other biographical studies,
London, 1936, Williams and Norgate, Ltd. A.
10. Fcinstein,
The
or the computer aficionados ter-
The challenges
of
9. Farr,
chair, the question-
room, the tabulator's annual volumes, the statistician's desk cal-
Influence
tality of cholera, Stat. Soc. J.
naire-collector's mail
minal.
Bio-
data,
1946.
and Cifford, A. J., editors: Chronic diseases and public health, Baltimore, 1966, Johns Hop-
ing them.
culator,
hospital
to
analysis
Bradford Hill, A.: Observation and experiment, New Eng. J. Med. 248:995, 1953.
not accept as final what some third
party can give or chooses to give.
heart
Limitations of the application of
metrics Bull. 2:47-53,
6.
One need
study of arteriosclerotic Chron. Dis. 16:249, 1963.
J.
R.:
Clinical
identification
tern.
rates
epidemiology.
of disease,
II.
Ann. In-
Med. 69:1037-1061, 1968.
11. Feinstein,
A.
Clinical
R.:
biostatistics.
II.
versus science in the design of ex-
Statistics
Pharmacol. Ther. 11:282-
periments, Clin.
292, 1970. 12. Feinstein, A. R.:
Clinical biostatistics. IV.
architecture of clinical
Pharmacol. Ther. 11:595-610,
Clin. 13. Galton,
F.
:
Quoted
in
The
(continued),
research
1970.
Inside cover of each
14.
issue of Ann. Hum. Genet. Greenwood, M.: Is statistical method of any value in medical research? Lancet 2:153-158,
15.
Heasman, M.
1924. A.: Accuracy of death certificaRoy. Soc. Med. 55:733, 1962. R. A., and Klebba, A. J.: A preliminary
tion, Proc. It is
rior to
the triumph of scientific .
.
.
the value of beliefs sufficiently
feel .
.
.
men
to rise
supe-
16. Israel,
by which may be ascertained, and to
superstitions, to desire tests
whatever
report on the effect of eighth revision ICDA on cause of death statistics, Amer. J. Public Health 59:1651-1660, 1969.
masters of themselves to discard
may be found
untrue.
G., Patton, R. E., and Heslin, A. S.: Accuracy of cause-of-death statements on death certificates, Public Health Rep. 70:39, 1955. Kendall, M. G., and Buckland, W. R.: A dic-
17. James,
When
Galton made that remark many years he was urging the use of statistical methods in epidemiologic science. His remark still pertains today; but the need is for scientific methods in epidemiologic staago,
18.
1960,'llafner Publishing Co. 19. Lasagna, L., and von Felsinger,
volunteer tistics.
Anderson,
20. Likert,
D.
deaths due to
O.:
Geographic
emphysema and
Canada, Canad. Med.'Ass. 2.
3.
subject
in
research,
J.
York,
M.: The
Science
120:
359-361, 1954.
References 1.
New
tionary of statistical terms, ed. 2,
Barrett-Connor,
E.:
The
J.
variations
in
bronchitis in
98:231-241, 1968.
etiology of pellagra
and its significance for modern medicine, Amer. Med. 42:859-867, 1967. (Edit.) J. Beadenkopf, W. C, Abrams, M., Daoud, A., and Marks, R. V.: An assessment of certain medical aspects of death certificate data for
21.
R.:
Public
opinion
polls,
American 179:7-11, 1948. Mainland, D.: Elementary medical ed.
2,
Philadelphia,
1963,
W.
B.
Scientific
statistics,
Saunders
Company. 22. National Center for Health Statistics, Division
of Vital
Statistics,
United States Department and Welfare: Vital Sta-
of Health, Education,
tistics of the United States, vol. I: Mortality, Washington, D. C., 1963, United States Government Printing Office.
The rancid sample, the
tilted target,
Center For Health Statistics: Cycle Examination Survey; sample and response. Vital and Health Statistics. PHS Pub. No. 1000— Series 11. No. 1. Public Service Washington, Health April, 1964, United States Government Printing Office.
23. National I
of the Health
R.: Cancer and tuberculosis, Amer. J. Hyg. 9:97-159, 1929. Pearl, R.: A note on the association of diseases, Science 70:191-192 (August 23), 1929.
24. Pearl,
25.
26. Rogers,
L.:
The
pollsters,
New
York,
1949,
Alfred A. Knopf. 27.
28.
General's
Education and Welfare, Public Health Service Publication No. 1103. p. 140. 29. Van Buren, G. H.: Some things
prove by mortality tics
—Special
30. Wallis,
W.
(
Paperback
31. Wright,
E.
and Roberts, H.
New
of
Statis-
Com-
York, 1965,
V.: The nature The Free Press.
edition.
C, and Acheson,
Haven survey
Advisory
R.
of joint diseases. XI.
arthrosis of the hands,
378-392, 1970. 32. Yerushalmy, J.
observed troversy
1966,
W.
:
On
associations,
Advisory Committee On and Health. Smoking and Health 1964. United States Department of Health, General's
you can't
Vital
Department
Reports,
A.,
of statistics,
Relman, A.
Surgeon
in
statistics,
merce, Bureau of the Census, 12:191, 1940.
Publication No. 1103.
Smoking
185
M.: New Observer
variability in the assessment of x-rays for osteo-
Committee On Smoking and Health. Smoking and Health 1964. United States Department of Health, Education and Welfare, Public Health Service Surgeon
and the medical poll-hearer
in
S.,
Amer.
J.
inferring in
causality
Ingelfinger,
and Finland, M.,
Internal
B. Saunders
Epidem. 91:
Medicine,
Company,
editors:
from F.
J.,
Con-
Philadelphia, pp. 659-668.
CHAPTER
13
Ambiguity and abuse
twelve
in the
different concepts of 'control'
Ronald Fisher began his classic The Design of Experiments, by
Sir
hook/
pointing out that a critic accept
who
refuses to
of an experiment of attack." In one ap-
conclusions
the
can take "two
lines
proach, the critic believes that the results
have received a faulty interpretation. In the second approach, the experiment itself is
regarded as "ill designed." After deciding that the first approach
—belonged "the seemed approach —criticism
criticism of interpretation
domain
of
statistics."
regret that the second
of design
— was
who were
to
Fisher
employed by people
often
not "professed statisticians" and
whose main
qualification
longed experience or session of a
was only "pro-
at least the
scientific
evidence"
in
long pos-
reputation."
complained that "technical
dom
to
when
Fisher
details are sel-
"heavyweight
a
authority" attempts to discredit the design of a research project
by making
assertions
such as "his controls are totally inadequate."
may have been
Fisher
quite
an expert who makes disparaging remarks without citing the justiof "technical
details,"
but where
can either an expert or a novice find an account of the details that define an adequate control? Such details are obviously crucial for scientific plans
Under
the
"Clinical
same name, biostatistics
14:112, 1973.
186
and
interpreta-
appeared as Pharmacol. Ther.
this chapter originally
— XIX."
In
Clin.
but the details are not presented
in
The
the writings of "professed statisticians."
concept and choice of a control receive almost no discussion in Fisher's work, and
word
the
control does not appear in either
the index or the table of contents for any of his three epochal books 79 on statistics
and research.
The challenge has also not been accepted by Fisher's successors. Despite his warning that "the statistician cannot excuse himself from the duty of getting his head clear on the principles of scientific inference," and despite the many subsequent statistical publications devoted to the "design of experiments," a clearheaded account of the
word
of control
principles
appeared
has not yet
The
the statistical literature.
in
control constantly occurs in diverse discussions, but the concept of
statistical
control
is still
concept
defined ambiguously, and the
often applied imprecisely or er-
is
roneously.
According
justified
in disparaging
fication
tions,
tionary, biter
10
Kendall and Buckland's
which usually serves
of statistical
employed
terms,
in at least
senses.
One
follows:
"If
cal as
to
two
dic-
an
as
ar-
control can be different statisti-
of these ideas
is
a process produces
stated a set
under what are essentially the same conditions and the internal variations are found to be random, then the process is said to be statistically under control. of
data
The
separate
observations
are,
equivalent to random drawings
in
fact,
from
a
Ambiguity and abuse
population distributed according to some fixed probability law." This statistical idea the regulation of a process, and
to
refers
applied most often as part of the reasoning ( to be discussed later ) that laborais
workers
tory
use
for
an
usually
entity
called Quality Control.
The second
new method,
the
for
". .
con-
.
of
testing
process, or factor against
an accepted standard.
That part of the which involves the standard of com-
test
parison
known
is
statistical
as
the
This
control."
idea refers to comparison, and
is
concept that most research workers would use in describing the role of a conthe
trol
when
different investigative
maneuvers
are contrasted.
In
addition
these
to
the terms control chart
two
definitions,
and control
limits,
both of which are applied during proce-
The
dures used for testing quality control.
third additional idea, called control of sub-
inquiries to
of
.
multiple
this idea,
stratification."
when we
variables,
substrata
if
we
.
According
to
we would
(
or "initial
control
the
express the results accord-
The
ways rather than
different
cite
some
science
idea In
three.
the
shall describe the
I
I
shall
of the points that weighty au-
might make about
thorities
of
its
abuse.
A
few of the points are statistical, but most of them rest on straightforward concepts in biology and clinical science. Perhaps the most direct way to demonstrate
these ideas
reference to the
in
is
"architectural"
model that was proposed
earlier
series 3,
in
search
this
project,
and
describe a re-
to
develop
to
trols."
In that model, the
entity
is
certain
choice of "con-
initial state of an exposed to a maneuver and undergoes a response observed in the subsequent state. With this model in mind, we can proceed to the different ways in which the
idea of control appears in the architecture of clinical research.
A. The idea of regulation
Four of the twelve different uses of control
occur
governing, or choosing a particular
entity.
The concept
at least three distinctly different ideas for
control: the quality
with which a process
performed; the maneuver that is compared in an experiment; and the charis
of
the objects
With these three
compared different
in
a
ideas
ambiguity in the use of control, it is not surprising that the term is applied with so much confusion. Unfortunately, however, Kendall and Buckland have
of comparison
is
not
included in these applications of the term. 1. Control of the maneuver. This type
ment from
women. The statistical dictionary thus provides
in reference to the idea of regu-
lating,
of control
survey.
which the
in
clinical
analysis.
rest of this discussion,
For example, if the baseline variables were age and sex, our controlled substrata might be expressed as young men, young women, old men, and old ables.
to create
in
actually applied in at least twelve
is
ing to groups demarcated with both vari-
acteristics
used
biostatistical
control
ways
other is
.
divide survey data that
have been cited for two baseline state")
control
"...
a term used in sampling denote the employment of factors which are being used in a scheme is
word and
scientific principles in the
Kendall and Buckland describe three other usages of control. Two of these occur in
strata,
many
omitted
diverse uses of the term control, and
definition of control
experimentation
cerns a
187
the twelve different concepts of 'control'
in
is
what
distinguishes an experi-
a survey. In an experiment, the
investigator
"controls"
— governs, de—the maneuver i.e.,
cides, assigns, or chooses
to
which each investigated entity is exposed. In a survey, the maneuver is chosen by nature or by man, but not by the investigator. Thus, a therapeutic trial
is
an experi-
ment, because the clinical investigator signs the treatment to each patient.
A
as-
sur-
vey of therapy is not an experiment, because the treatments were chosen ad hoc by other doctors or by patients, or by both. Almost all of the contemporary epidemiologic investigations of alleged causes of disease such as diet, urban living, smoking, and oral contraceptive pills have
—
—
Other architectural problems
188
been performed not as experiments, but surveys in which the investigated ma-
scientifically
neuver was usually self-selected.
bias.
as
the
It
ment
the assign-
controls
investigator
of the maneuver,
can make the
lie
randomization, thereby
assignment bj
al-
lowing subsequent application of statistical inference based on random allocation. If the investigator does not control the ma-
neuver,
bias
possible
the
decisions.
The
with
occurs
allocation
its
can
that
all
of
human
enter
differences found thereafter
may be caused by
the
rather
bias,
initial
than by the effects of the compared ma-
The
neuvers.
found
differences
and one of the
inference,
out the bias and for
it.
the
make
is
not
is
seek
to
suitable adjustments
This potential bias
maneuver
main
scientist's
analyzing survey data
in
in
allocation of
removed by
selecting
Quality control. The regulation of a
2.
process evokes three different uses of the
word
control. In the procedures of 'quality
control"
land
— the
or
1
performed,
shape, or
cal surveys this
the
statistics
con-
epidemiologic and
clini-
fallacious
today are due to confusion about
who
type of control. The statistician
about research from courses in experimental design may not be familiar with the many potential sources of major
the
serum
of
cholesterol.
well the process uses
investigator
is
being
certain
size,
some other dimensional property
of the finished product.
the individual
If
products have similar values for
this
mea-
surement, the results are regarded as con-
and the process
sistent,
is
used
itself.
principal
if
is
regarded
as
An
analogous procethe "product" is a measure-
having high quality.
of the
many
to
and Buckproduce a
For a tangible product, such as beer, the
young and also selectively urban regions, we shall find a higher death rate in urban rather than rural locations, regardless of whether or not the urban-rural sample of population was chosen randomly. of
used
is
of the
first
assessment depends on measuring the
ment
Many
the
to
Kendall
well-defined statistical tactics.
dure
tained in so
process
measurement
a
"random sample"' of population, since aspect of randomness refers to choice of the population, rather than assignment of the maneuvers. Thus, if tense people to
by
To determine how
this
are likely to die
refers
cited
particular entity, such as a barrel of beer
a
move
— which
definitions
these
in
circumstances cannot be properly assessed with tests based on probabilistic statistical jobs
misleading if the basic information has been distorted by unrecognized
Such measurements are the
event
occurs
that
in
modern
laboratory medicine, where quality control
numerous chemical, microbiologic, and other technologic tests has become a paramount scientific challenge. The approaches used in meeting this challenge are heavily dependent on statistical principles, and a thoughtful account of the procedures has recently been published by Roy Barnett.- From a repeated measurement of the same specimen of material,
laboratory
the
personnel
mean and standard
can de-
learned
termine the
bias that can occur in a non-experimental
and can prepare control charts that show the upper and lower control limits for the measurement. The personnel can also de-
deviation,
survey, in which the investigator did not
velop procedures for determining "external
control the assignment of the maneuvers.
quality control"
Instead of urging the investigator to search
is
for bias ingly,
and
to
correct the data accord-
the statistician
He may
tests
of
are
dis-
This
is
statistical
the
same specimen
the type of activity for which principles
are
applied magnifi-
cently, since the process of measurement
then massage the data with
was the stimulus for the original ideas that led to development of theories about the disperson of "errors" in measurement, and that produced the shape of the frequency
if
significance,
relative
logistic function analyses,
that
accept the
when
at different laboratories.
they were experimental
torted results as data.
may
measured
statistically
risk
rates,
and other
tactics
inappropriate
and
Ambiguity and abuse
now known
distribution
as
a
in
the twelve
"normal" or
cliff erent
often occupies an unusual position in the
Gaussian curve. Although developed for the variations found in the process of mea-
trol
suring objects, these statistical
categories
were
principles
applied to describe the vari-
later
ability of the objects themselves.
The new now bebecome a
189
concepts of 'control'
architecture of a research project.
and
as
good or bad,
in remission or
able
The
con-
expressed as a variable having such
is
is
rise
or
fall,
active, but the vari-
not a baseline characteristic noted
and
application of the principles has
in the initial state,
eome
principal target event noted in the subse-
traditional,
but has also
source of major problems"' in clinical epi-
We
can safely assume that multiple measurements of a single specimen of serum glucose will have a "normal distribution" and we can safely ac-
demiologic research.
cept the
mean
of those values as the "cor-
measurement. We cannot safely assume, however, that single measurements of multiple specimens of serum glucose from a group of people will be "normally distributed," nor can we safely assume that a "normal range" for the people who provided those specimens will be determined by Gaussian statistics. These problems, which are beyond the scope of this discussion, create major diffirect"
culties
in
clinical
interpret a single
activities.
In trying to
measurement
for a pa-
the clinician must contemplate the
tient,
it
is
often not the
quent state. Thus, in a study of hypoglycemic agents for patients with diabetes mellitus, the main target events might be vascular complications, and the concomitant regulation of blood sugar would be an ancillary variable in the subsequent state. In a previous discussion,'" I suggested the term synchronous variable for the citation of this type of regulation. It refers to an entity that is noted after onset of the main
maneuver (such
as
therapy), and before
during the occurrence
or
rence)
of
the events
(or nonoccur-
that are the
main
target variables.
In other situations, the regulation of the disease process
may be
the principal tar-
get of the maneuver. Thus, the production (
is
rather than maintenance
often
)
of a remission
the main goal of treatment for
variability of the laboratory's quality con-
acute leukemia, and the reduction of "dis-
encountered in human populations. To cope with the latter forms of variability, the clinician must consider some of the many other
ease
trol,
as well as the variability
types of control in statistical
that are not expressible
formulations.
3. Control of the disease process. This type of control refers to the adequacy with
which a program of therapy produces cerdesired effects in the "activity" of a
tain
disease. In hypoglycemic treatment of diabetes mellitus, for example, the purpose is to keep the blood sugar or urine
patient's
sugar regulated within certain boundaries.
When
such regulation occurs, the sugar is said to be in "good
the patient)
(or
control."
ferring in
a
the
A to
similar phrase is used in remaintenance of a "remission"
patient with acute leukemia, or to lowering of serum cholesterol in a
patient with hyperlipidemia.
The
description of this type of control
activity"
in patients
may be
the
prime target
with rheumatic fever or rheu-
matoid arthritis. 4. Control of the environment. This topic refers to the investigator's ability to govern the environment in which the research maneuvers are administered. The experimenter in the laboratory can "control" the cages in which the animals live, can ascertain that they are given and ingest the assigned medication, and can make them appear for all of the planned examinations and procedures. The investigator who deals with human populations has no such options, and must contend with the possible bias caused by migration or loss of patients, and bv noncompliance with either the prescribed maneuver or with the schedule of follow-up examinations.
The the
difficulties
inception
of
between maneuver and th<
that can occur
the
observation of the subsequent state have
Other architectural problem*
190
been discussed elsewhere* as problems in "intrusion." Among the most striking of these problems are those that arise effective oral
.in
deemed take
to
unusual
medication
when
spuriously
is
ineffective because patients failed
when compliance
it;
an
requires
mav
psychic resolution that
dis-
tort the characteristics and outcome of the sell-selected group ot people who are able to comply; and when the detection of the
target event is
compared groups
in
of people
distorted by inequalities in the frequency
and
the diagnostic procedures
intensity of
used to detect the target
The performance
of
can also often be ex-
target detec tahiiitx
synchronous variables. Although control of these environmental fea-
pressed as the
tures ol research
any
lation."
here or
cited
become
later,
ing results. cal or
the synchronous variables the previous section can
in
member
a
described
another aspect of "regu-
is
of
may depend
in sugar, cholesterol, or
white
blood count; or the frequency and intensity diagnostic
neuver that
maneuver
ACTH; vs.
The
tests.
Nevertheless,
as
generally ignored in most sta-
The
many
gradations
stratified
type of "control"
neglected in
may be
complete-
statistical reports, or
the synchronous results
may be
analyzed in
an unsatisfactory retrograde manner.
choice of appropriate control masubtle act of
As described previous-
proper choice requires suitable
a
lv,'
tention
of
at-
complex array of "technical
a
to
potency dosage or other procedures with which
the maneuvers are administered; the rela-
with which comparative maneuvers
tivity
are chosen to demonstrate efficacy or efficiency;
accompaniment pro-
the internal
vided by the ancillary ingredients or excipients of a pharmaceutical agent, or the gery; the external
precede, parallel, or follow the principal
maneuver. In view of the plexity,
it
scientific
com-
not surprising that investiga-
is
sometimes choose control maneuvers are "totally inadequate" and that
statisticians
may
not recognize the defects.
A classic example of an unsatisfactory control maneuver occurred in a highly publicized recent experiment 1 designed by a pathologist collaborating with a
statistician.
smoke
was
beagle dogs,
whom
The idea of comparison next five types of control
all
depend
on the concept of comparison. The confusions in usage arise because the comparisons are based on different elements in the research architecture. Some of the comparisons refer to the maneuver; others, to the initial state of the objects under investigation; others, to the subsequent state; and yet others to the transitions between initial and subsequent state.
Warm
air
containing cigarette
pumped into the tracheostomy of who were then examined for pulEvery
lesions. I
promptly
The
sophisticated,
a
is
monary B.
the
vs.
the active drug;
vs.
"models" of experimental design for
clinical research. in this
injection
saline
scientific reasoning.
tors
ly
— the
the placebo
neuvers
that
are
the ma-
to
dose vs. the low; non-smoking smoking; urban vs. rural life.
dis-
tistical
refers
the high
cussed previously,'' these synchronous variables
It
compared with the principal
is
accompaniment provided by postoperative recovery rooms and solicitous medical personnel; and the concurrency with which comparative maneuvers
crucial analyses in clini-
according to compliance with the maneu-
of
think about control.
anesthesia and other concomitants of sur-
epidemiologic research
changes
the
is
compar-
on appropriate division of the population ver;
maneuver. This
control
of the "control strata."
that are used for
Main
The
details." These- details include: the
compliance and
ot
/.
sense in which most scientific investigators
have
selected
Warm
the
layman
intelligent
described
this
experiment
appropriate
to
has
ma-
control
(devoid of cigarette smoke) should have been pumped into the tracheostomy
neuver. of
air
the control dogs.
experiment,
the
Nevertheless,
in
the actual
comparative maneuver consisted
of insertion of a tracheostomy tube alone,
out anything
The tive
made is
with-
else.
decision that a particular compara-
maneuver
is
with any
apparently
inadequate cannot be reasoning and
statistical
sometimes not evident
to
Ambiguiti/ and abuse
experienced
in
die twelve different concepts of 'control
By considering
investigators.
such plans
are
191
obligated
scientifically
to
the stated principles 4 of potency, relativity,
describe
accompaniment, and concurrency, a reviewwhat may be wrong, but the application and interpreta-
Without such descriptions, the readers of cannot discern whether a gross imbalance in the size of compared groups is due to perceptive planning, ca-
er can often quickly discern
of those principles
tion
on
of judgment,
a 'mixture
common 2.
usually rest
will
wisdom, and
The "conentities who re-
group" consists of the ceive the comparative (or control) trol
These
who who
entities
might be the
marats
get the saline injection, the patients
receive placebo, or the dogs
warm
(or failed to get)
air
who
blown
get
into a
An
important
search
is
pricious plodding, or an unforeseen disaster
that the groups receiving the con-
maneuvers and the investigative maneuvers be qualitatively similar before the maneuvers are instigated. This qualitative aspect of similarity depends on the control
trol
3.
Kendall and Buckland. paring treatment
we
would like the compared groups to have the same number of members. This equalsize principle is intuitively appealing and has
most
the
numerical ceteris
likely,
virtue
paribus,
the results.
"statistical significance" in
On
certain occasions, however, particu-
when
larly
compared of the
multiple maneuvers are being in
groups
same experiment, some
the
may be made
proportionately
meaning cited by If we were com-
Y in pawe would
treatment
vs.
"control" for sex
if
we
divided the patients
men and women, and of
the two
analyzed the
treatments
separately
each sex group. Readers of previous papers in this series should immediately recognize that this type of "control" is achieved by stratifying the patients according to characteristics
in
that are present in the baseline initial state,
before the maneuver
is
imposed. The pur-
pose of prognostically predictive or other forms of stratification tive similarity in the
is
to achieve qualita-
compared groups. This
type of stratification
is
used
to
we might
proportions of important strata in the peo-
group
For example,
acts as the "con-
four treatment groups,
compared maneuvers.
any of the others. If a single medication is being given at two dosage levels, we might make each dosage group half the size of the other groups so that the results for the two dosages can be combined into a single full-size group for that medication. The planning of deliberate inequalities in the size of compared groups is one
ple assigned to the
of the statistical subtleties of experimental
sources of bias that enter a survey.
er than
design. Regardless of
are
avoid or
the placebo group substantially larg-
a single placebo
trol" for
make
"
reduce the bias that may occur when maneuvers are not assigned randomly or when the randomization does not produce equal
larger or smaller than others. if
being produce
of
to
X
1
with diabetes mellitus,
tients
results
a different issue, however. Ordinarily,
is
occurred.
control refers to the last
into
groups
of
similarity
un-
Control of the strata. This type of
The
quantitative
will pass
An interesting example of an unexplained gross imbalance in sample sizes occurred in the aforementioned investigation 10 of smoking dogs. The group that "smoked" through a tracheostomy contained 86 dogs. The "control" group that had a tracheostomy alone contained 8 dogs. The investigators provided no statement of how and why
of strata, as described in the next section.
also
hope
noticed.
principle of re-
scientific
justifications.
a published report
this bizarre disparity
tracheostomy.
and
reasons
that the investigators
sense.
Size of the control group.
neuver.
their
chosen,
the
why
the inequalities
investigators
who make
Attention to this type of "control" has
been
strikingly absent
statistical
"experimental design." a
from most general
textbooks and works devoted to
statistician
who
One
thinks
reason
is
that
only about ex-
periments, but not surveys, will obviously
have reason
no is
accepted
incentive
to
contemplate
A
the
second
the basic fallacy of the currentl' statistical
"model" of experiments.
Other architectural problems
192
This model, which depends on the effects of only a single maneuver, is inadequate
and epidemiologic for the common situations in which a maneuver of human intervention or pathogenesis is imposed on clinical
an underlying maneuver of nature (or of man). The purpose of the stratification is to identify what was done by the underlying maneuver to separate the people accord-
—
ing
their
to
different
the
treated
imposed maneuver.
may
If
the
than
longer
know about
to
referral
iatric
the
equal
Were
the children
likely
to tolerate
introduced
bias
their
in
For
whole."
this
reader would surely want
a
patterns.
tients
a
as
who were
others"
the
country
the
"in
type of conclusion,
Were
of
severity
clinical
who were
by
inter-
the compared paillness?
and more and the chemo-
"healthier"
both the travel
referred to the specialist phywhereas the sicker or moribund children were kept at home to be treated "in the country as a whole"?
therap)
generally
sicians,
A
degrees of haseline
susceptibility for the target event that later follow
considerably
ill
according to clinical severity of
stratification
would therefore be expected
illness
in
this
survey. In other forms of cancer, such
type
stratifi-
cations are regularly attempted with the "staging
parental longevity, severity of clini-
stress,
cal condition, or other features that create this
are
susceptibility
identified
appropriately
not
and separated, the investigator
courts an intellectual disaster.
He may
at-
imposed maneuver, results that are really due to the unrecognized underlying maneuver. This error is the data analyst's equivalent of the post hoc ergo propter hoc folly into which clinical reasoning has so often been seduced throughout the centuries. Because the current statistical "model" tribute,
to
the-
of "experimental design"
one maneuver,
rather
is
concerned with
than
ma-
dual
a
have had no major intellectual incentive to consider the need
neuver,
statisticians
two types of "control": a control that compares the overt imposed maneuvers, and a control that creates strata to "equalize" the previous effects produced by the antecedent underlying maneuvers. The literature of modern "science" thus becomes filled with large amounts of clinical and
for
epidemiologic data that are misleading or worthless, because to
no attempt was made
perform the control
stratifications
that
systems" that are used in an effort to distinguish severity of illness or other major prognostic differ-
among
ences
children
with
acute
and
type.
The
cell
cording
in
a
recent
survey 11
Medical Journal.
A
ment
of
problem appeared
published
stellar
in
known
to in
leukaemia
in
the
British
array of investigators
by physicians specializing
reported that "the children treated
this
have been
the treatchildhood have survived
to
com-
.
to
age
were not staged acillness
or ac-
any of the prognostic features that
might militate
for or against referral to a special-
ist.
Unfettered by the absence of randomization and by neglect of a major source of selection bias, the investigators nevertheless concluded that "there is good reason to believe that the improvements in survival" were due to "the availability of special facilities and expertize." The investigators also recommended that "it would seem desirable that children with acute leukaemia should be referred, where this is feasible, .
.
.
a centre specializing in the treatment of the
to
The chairman
disease."
previous
a
in
of the "working party,"
publication, 12
made
the
remark
that "practising clinicians have not always taken
kindly to the statistical approach to medical problems in which patients are considered as units in a
more
or less
of
suitable
sence
main reasons clinical
bias,
example of
.
leukemia according patients
cording to the clinical severity of
left
excellent
.
parisons." This "care" consisted of stratifying the
selection bias, susceptibility bias, detection
An
same age and the
The leukemia "working
nate bias so far as possible from the
nostic
compliance bias, chronology bias, and forms of bias that permeate contemporary statistics.
type.
cell
party" stated that they "had taken care to elimi-
could help remove the diverse forms of
other
patients with the
same neoplastic
stratifications
the
conclusions that emerge
heterogeneity
of
is
scientifically
patients
one
when is
of
the
unacceptable the prog-
ignored
and
uncontrolled.
4. all
for
homogeneous whole." The ab-
The epidemiologic
"case-control." In
of the comparisons discussed so far, the
scientific
ward
reasoning proceeded
in
a
for-
from cause toward effect. The comparisons dealt with the initial condition, the maneuvers, and synchronous changes that accompanied the maneuvers but the populations were all being fol-
—
direction,
Ambiguity and abuse
lowed from subsequent
A
toward
their initial state
in
their
among chonic
disease
epidemiologists calls for a total reversal of the direction of scientific logic. Instead of
pursuing a cohort group in the customary sequence of cause —* effect, the epidemio-
may
logic investigator tion
whom
in
By
curred.
begin with a popula-
then called
methods of
then
epidemiologist
follows
backward toward the putative Thus, to determine whether con-
The
5.
lead to thrombophlebitis,
would assemble a population of pill-takers and non-pill-takers, follow them forward, and determine the rate with which the people in each group develop thrombophlebitis.
Many
epidemiol-
however, would start with people who already have thrombophlebitis, and would then determine the proportion who had taken the pill. ogists,
In
the epidemiologist
situation,
this
is
confronted with the problem of getting a
group of people for comparison. The people cannot be chosen in the customary scientific manner, on the basis of receiving
control value.
We now
turn to a
form a transition variable. For exwe can measure the baseline level of serum cholesterol and call it the "control value." We then treat the patient with a lipid-lowering agent, and, after a
period
we measure
time,
of
The change between
again.
becomes the
of cholesterol then
that
is
The change
analyzed.
an increment
as
cholesterol
the two values transition
expressed
is
decrement)
(or
if
subtract the control value from the
we new
value; or as a proportion or percentage
we
divide the
not used for comparison in a scientific
sense. It serves merely as a numerical in-
gredient in the subtraction or division by
which we calculated the change that
because
the value of the transition variable.
ducted
the a
in
investigation
backward
is
direction.
The
chosen according to the
pa-
if
value by the control
In this situation, the control value
value. is
new
or not receiving the investigated
maneuver, being con-
is
The
value of a before-and-after pair of values
ample,
scientists
result
group.
different type of "comparison": the before
that
pills
The
"case-control"
greater detail in a later installment of this
cause.
most
a
series.
the "cases"
traceptive
were conveniently
that
tortuous reasoning used for this unique form of "control," and the many biases it ignores or creates, will be discussed in
the effect has already oc-
historical or other
the
inquiry,
and other data
sex,
available to the investigator.
state.
quaint tradition
193
the twelve different concepts of 'control'
Many
investigators
seem
to believe,
is
how-
"effect,"
ever, that this type of control value has a
but the control groups from the "initial state," not from the "subsequent state." Nevertheless, with a majestic dis-
stronger role in comparison, and can sub-
play of antipodal scientific logic, epidemiol-
professor
tients are
not
the
used in
ogists
"cause,"
scientific research are selected
will
choose a contrast group from
"subsequent-state" people in fect has
whom
the ef-
not occurred. For example, in a
study of the allegedly evil effects of oral contraceptive for
this
pills,
the eligible contenders
contrast group
would be people
who do not have thrombophlebitis. To add a semblance of scientific veneer to this retroverted scientific
demiologists
usually forms of "matching."
procedure, epi-
engage
Members
in
various
of the con-
group are usually matched to the case group according to features of age. trast
stitute
for
the
provided by a remember one in-
"control"
comparative maneuver.
I
when an august
stance several years ago
wanted
to
test
the
effect
a
of
certain surgical procedure on a particular
chemical in blood.
He planned
to
measure
the level of the chemical before surgery
and
after
surgery,
and he intended
hold
the
surgery
responsible
changes.
When members
for
to
any
of the institution's
research committee insisted that he examine
he claimed he did not need one because the pre-surgical values were alone satisfactory. The patients would a "control group,"
act as their
The
"own
failure of
controls."
prominent
clinical
micians to understand what
is
acade
meant
1
Other architectural problems
194
a controlled comparison might be excused
"control" maneuver, not merely a "control"
have had by the Fact that many It research in methods. so little training
value.
clinicians
more
is
a suitable excuse.
difficult to find
same confusion statisticians, and when this confusion becomes the- instructions offered however, when appears among
textbooks devoted to the de-
statistical
in
exactly the
and analysis
sign
research data.
ol
The quotation below From a prominent, textbook
of
analogous example biostatisticS.
not
book by name here because tion
the
of
literature
either
cite
my
text-
Had
done the research more thoroughly, might have found similar examples
I
many
other leading statistical books.
exact
text
is
the "example"
t
in
of
how
The
to
be tested
in
The
that book
for
its
believe
to
average,
the
raise
that
the
the
stimulus
mean
systolic
would,
Mood
three types of control refer to
last
regulation
table,
on
pres-
patient,
that
t
Stabilization.
1.
many experimental
is
imposed. The main purpose of to allow the observed
stabilize"
this interval
of data
values
into the result that will
be
called the baseline or "control" value. For
example, just
the
the
cited,
been
a
blood pressure experiment
in the
single
baseline
may have
value
reading taken just before
stimulus was
imposed. Alternatively,
may have been observed
during
a control period of several days or weeks.
From
the
many blood
pressure values ob-
then chose (or calculated) what was used
for
level of "statistical significance,"
state
that
"we do not have
evidence to conclude that the
stimulus
increases
implication to
is
scientific
blood
pressure."
The
were only high be "significant," we would have
satisfactory
to
In
a time interval of observation before the research maneuver is used
tained during that interval, the investigators
values
sufficient
This
same phrase:
perform a paired t-test. and find is only 1.09. Because this value is
authors
enough
and
comparison,
each
betorc-and-after
below the the
contained in the cited
the authors calculate the difference
the
in
nor
situations,
the patient
the data
a paired t-test.
control period.
sure?
From
been supexample
effect
on Mood pressure. Twelve men have their Mood pressure measured before and after the stimulus. The results are shown in Table 8-3. Is there reason
perform
are produced by exactly the
to is
to
neither
is
certain stimulus
in
statistics, this bla-
Tbe control period
C.
I
as follows: \
of
tant error in research design has
an
investiga-
incomplete.
is
observed "stimulus." Nevertheless,
renowned textbook
known
found
have
I
the .1
another textbook of
in
shall
1
directly
internationally
statistics.
experimental setting, a flaw in the equipment, or many other causes other than
plied to students as a fundamental
taken
is
The change observed by the statismay have been due to anxiety, the
ticians
is
that
if
t
evidence for that conclusion.
precisely the type of defective
reasoning that clinicians are urged
avoid by getting consultative help from
The paired
might allow
as the baseline value.
The
calculation
may
have produced a mean, median, mode, or some other derivative of the data observed during the entire control period, during a period of apparent equilibrium, or during the few readings that preceded the imposition of the stimulus.
This sound scientific procedure
marred
is
only by the frequent failure of investigators to report the details of
line"
value was
how
the "base-
chosen from the values
cant" change had or had not occurred in
observed during the control period. 2. Qualification. A second usage of control period is in reference to a time interval
the two pairs of readings, but no scientist
during which the investigator
would conclude that the imposed stimulus was responsible for the change. To reach the latter conclusion, we would require a
son's eligibility or qualification for admis-
statisticians.
t-test
the investigator to conclude that a "signifi-
tests a per-
During the patient may be called upon sion to a research study.
this
to
period
demon-
Ambiguity and abuse
in
the twelve different concepts of 'control'
195
strate
such features as compliance with proposed protocol, ability to remain free of diabetic ketosis on diet alone, etc.
knowledge
the
twelfth type, the statistical principles used
This type of "qualification
sagacity
sonable procedure jects.
Its
many
in'
main hazard
substantial
bias
in
The people who
is
research pro-
that
it
can create
ultimate
the
a rea-
trial" is
results.
the qualifications for
fit
compliance or other standards may not be a representative sample of the general
more profound than
are obviously confused about the topic of control, clinical
and epidemiologic
educate
to
our
maneuvers were
patient
may
re-
ceive placebo or no treatment during this interval.
This
type
control
of
somewhat analogous
to
the
period
first
is
type in
purpose is to restore stabilization. The main hazard of the washout period is that it is sometimes too short, or omitted that
its
when
necessary.
ticularly
The
when
A
frequent problem
subjective
—par-
symptoms
are
—
about
colleagues
successive therapeutic agents administered
The
investi-
gators are obligated to clarify the issues
is
in a cross-over trial.
familiarity
with a mean, a standard deviation, and a Gaussian distribution. Because statisticians
arcane aspects of
dition that
in the
no mathematical
in quality control require
who have the conunder investigation. 3. Washout. The last usage of control period is in reference to the "washout" interval that is often necessary between the
population of people
Even
of statistical theory.
scientific research.
who was
statistician
"raised" to think
only about experiments will not appreciate the bias that can occur in surveys where the populations were self-referred and the
tician
who
The
self-selected.
statis-
regards experimental design as
an exercise in analysis of variance and Latin squares cannot appreciate the issues of potency, relativity, accompaniment, and concurrency that are needed to choose a control
maneuver. The
statistician
whose
research model makes provision only for a single
maneuver cannot appreciate the need
for controlling strata
in
the
common
an important variable is the decision about whether the patient should receive placebo
search
or no treatment during the washout.
a control value for a control group
The main purpose to point
of this essay has
been
out the confusion that exists about
the term control,
some of the "technical details" whose absence was so distressing to R. A. Fisher. The lack of specific information about what constitutes and
to specify
an "adequate control," however, does not
seem
to
have inhibited
statistical disserta-
devoted to the "design of experiments." Nor has the absence restrained "heavyweight authorities" in the domain of
tions
from delivering oracular concluabout the interpretation of epidemio-
and
these
that
situations
maneuver. The
have
difficulty
to the
people
involve
who
statistician
supplying
who
re-
double
a
mistakes
helpful
may
advice
consult him for guidance
designing research.
in
Ronald Fisher 8 said that "the statistician cannot evade the responsibility for understanding the processes he applies or recommends." In exchange for the many useful tactics that statisticians have brought Sir
to science, the least that scientific investigators
can do
is
to help statisticians under-
stand the processes and meet the responsibility.
statistics
sions
References
experiments in which the crucial controls
Auerbach, O., and Garfinkel, ing on dogs.
were
tality,
1.
logic surveys, clinical trials, or laboratory
either
omitted,
malconceived,
or
different types of control
cited here, eleven are best discerned, noted, chosen, and evaluated with principles that are inherently scientific and that require no
L.:
E.
C, Kirman,
D.,
Effects of cigarette smok-
I. Design of experiment, morand findings in lung parenchyma, Arch.
Environ. Health 21:740-753, 1970.
otherwise scientifically unsatisfactory.
Of the twelve
Hammond,
2.
Bamett,
R.
N.:
Clinical
Boston, 1971, Little, 3.
laboratory
Feinstein, A. R.: Clinical biostatistics. tistics vs.
Clin.
statistics,
Brown & Co. II.
Sta
science in the design of experimc
Pharmacol. Ther. 11:282-292,
197
196
4.
Other architectural problems
Feinstein,
A.
H.:
Clinical
biostatistics.
ed.
III-V.
The architecture of clinical research, Clin. Pharmacol. Ther. 11:432-441, 595-610, and
().
755-771, 1970. 5.
Feinstein, A. R.: Clinical biostatistics, XII.
exorcising
On
the ghost of Gauss and the curse
of Kelvin, Clin.
Feinstein,
A.
K.:
Pharmacol. Ther. 12:1003Clinical
biostatistics.
XVII.
York,
1966,
Hafner
Fisher, R. A.:
Publishing
methods for research Edinburgh, 1970, Oliver &
Statistical
workers, ed.
14,
Boyd, Ltd. M.
G„ and
Buckland,
dictionary of statistical terms, ed.
11.
Synchronous partition and bivariate evaluation
3,
W.
R.:
A
Edinburgh,
1971, Oliver & Boyd. Ltd. Report to the Medical Research Council from the
Committee on Leukaemia and the Working
Pharmacol,
Party on Leukaemia in Childhood. Duration of
Theb. l.liPart 1): 755-768, 1972. R. A.: Statistical nut hods and scientific inference, ed. 2, Edinburgh, 1959, Oliver & Boyd, Ltd.
survival of children with acute leukaemia, Br.
in
predictive stratification. Ci.iv
Med.
7. Fisher,
8.
New
10. Kendall,
1016, 1971. 6.
8.
Co.
Fisher,
R.
A.
The design
of
experiments,
12.
J.
4:7-9, 1971.
Witts, L. ical
trials,
J.,
Medical surveys and clinLondon, 1964, Oxford Uni-
editor:
ed.
2,
versity Press, p. 5.
CHAPTER
14
The epidemiologic and
risk ratio,
About two years ago,
public
rebuke
ternity.
Dr. John
the
to P.
epidemiologic fra-
Fox,
who
a note "to call attention to
surprising
co-author
is
error
in
and correct
a report
presumably was produced by highly
persons
of
textbook of epidemiology, 14 wrote
new
of a
readers
the
there appeared 11 a remarkable
series,
this
'retrospective' research
in a journal that
may seldom be seen by
distinguished
.
.
.
.
.
as
(that)
basic
of
conceptual
and the terms that express these the fundamental vocabulary the domain. Investigators trained in
become
may not fullv unwhen applying them domain, and may misuse either
other forms of research
derstand those ideas in
another
words or the associated For example, in reporting the
the corresponding
concepts.
results of therapy, clinicians regularly con-
fuse
the
epidemiologic nuances that
dis-
from prevalence, and mortality rates from fatality rates. Converse-
tinguish
incidence
disease."
The misuse tive
"Clinical
biostatistics
14:291, 1973.
—
chapter originally appeared as XX." In Clin. Pharmacol. Ther. this
of prospective and retrospecby epidemiologists themselves is un-
—
rectly
in
those words, in contrast to the
meanings that have been created to distinguish incidence from prevalence, and mortality from fatality. The confusion about prospective and retrospective seems arbitrary
from anv arbitrarily created disbut from the unfortunate custom of using the terms to describe two different and sometimes conflicting concepts of research. One concept refers to the directo arise not
tinctions,
tional
pursuit of a population;
refers
to
method used
the
for
the other collecting
research data.
The is
Under the same name,
a diagnosis listed on a death from the true occurrence of a
fundamental to any scientific reasoning about populations a subject on which epidemiologists are regarded as experts. Furthermore, the ideas of "looking forward" and "looking backward" are expressed di-
ideas,
ideas
distinguish certificate
expected, however, because these terms are
Like even' domain of science, Epidemicertain
causes of "disease"
regularly confuse the clinical nuances that
several
.
Dr.
contains
tality rates for different
epidemi-
retrospective.
ology
epidemiologists analyzing general mor-
lv,
a
Fox pointed out that the authors of the cited report had made incorrect use of the terms prospective and ologists."
trohoc, the ablative
basic unit of epidemiologic research
a person
When
who
groups
vestigation,
has a particular condition. of
the
persons
are
under
in
prospective-retrospe
197
Other architectural problems
198
terminology has been applied for two difresearch activities: (1) the chron-
ferent
direction
ologic
followed
in relation
way
the
in
which each person is to that condition, and
the investigator gets the data
describe each person's observed and
that
other conditions. this
series,
In
an earlier paper
in
1
discussed the conflicting am-
1
biguitv and scientific confusion caused In the
two
sets
that the best
was
ol
wa)
to
I
suggested
remove the
difficulties
discard the words prospective and
to
retrospective. sets
concepts, and
We
could then use two new
terms to describe the two different
ol
concepts
The two sets of terms
A. /.
The
collection of data.
An
investigator
two basic way, the person under
can assemble research data
in
ways.
In
study
was originally observed
the
first
b\
people
who were
not
vestigation,
and who reported the observa-
tions
in
performing a specific
in-
routine records. Afterward, to get
the research data, the investigator extracts the information available in those records. In the
second way, the investigator makes beforehand for the techniques
special plans
with which each person
is
and the data recorded. The
to
be examined
way
1965.
Therefore, to do the re-
earlier
in
search,
we would have
formation entered
From
records.
the
we would
registry,
to rely
on the inmedical
in the hospital's
diagnostic
hospital's
find the
names
of all
premature babies born in 1965. From each baby's medical record and from supplemental ad hoc communications, we would the necessary data about birth weight and subsequent growth. Regardless of whether the research was collect
done prolectively or retrolectivelv, we would attempt to assemble the same group of patients and to follow that group in the same chronologic direction from birth onward. The main tactical difference would be in the methods used for collecting the research data. 2. The direction of populational pursuit. The other set of terms would refer to the
which a group of direction can be either forward or backward. For example, suppose we want to know whether oral temporal
direction
people
followed.
is
in
The
pills predispose to thromboembolism. In forward research, we would assemble a group of women who use the
contraceptive
pill
and another group who use other forms contraception.
of
We
would then follow
of col-
the two groups forward and note the rate
lecting research data can be called retro-
which they develop thrombophlebitis. backward research, we would assemble a group of women who have developed thrombophlebitis and another group who do not have thrombophlebitis. We would
lective,
The
first
and the second way,
protective.
procedures for primary data will depend on whether the research was planned before or after the persons under surveillance ability to use special
acquiring
the
reached the locale
was
at
which
their condition
at
In
then note the proportionate frequency with
which the two groups had previously used
be observed. For example, suppose to study the growth achieved in the first year of life for premature babies born in our hospital during 1965. If we
oral contraceptives.
decide in 1964 to begin
distinguishing
to
we want
this
research the
Both types of research could be done prolectively or retrolectively, according to
the
methods used
following year,
we can make advance plans performing suitable examinations to collect the data from birth onward for all appropriate children. If, however, we decide in 1973 to do this same research project, we cannot make any advance plans,
projects
for
tional
because the conditions we want to study were noted at our hospital eight years
lowed
would be the pursuit,
rather
of
the
first
two
cited
direction of popula-
than
which the research data were the
The
for assembling data.
feature
the
way
in
collected. In
project, the population
is
followed
forward, from "cause" toward "effect." In the second project, the population
cause.
backward,
from
"effect"
is
fol-
toward
The epidemiologic
The word cohort has been describe a group
and
trohoc, the ablative risk ratio,
of people
may
established to
term "retrospective"
who
tion of the research, but
are pur-
describe the direcit
does not pro-
name for the different groups under The term cases or case-group is
sued in the forward (or "prospective") di-
vide a
rection that characterizes scientific research.
study.
Thus, to assess the growth rate of prema-
sometimes used for the people
ture babies in the projects described earlier,
our
would have involved a and the second, a retro-
project
first
prolective cohort,
active
ment
cohort.
In
studying the develop-
of thrombophlebitis in pill-takers or
we would perform
non-pill-takers,
research, regardless of
cohort
whether the data are
199
'retrospective' research
is
whom
the
and the term
"effect" has alreadv occurred,
case-control
in
who
applied for the people
are in the "non-effect" group. These terms are not particularlv specific, since the
word
case can imply almost anything, rather than
the occurrence of an "effect." stitute
A good
sub-
the cases or "effect" group
for
is
obtained prolectively or retrolectively. In
probands, a term that has been employed
studying the antecedent use of contracep-
by Taube
people with or without throm-
tive pills in
we would
bophlebitis, however,
not be ex-
amining a cohort.
how
Regardless of
a cohort population
the data are collected, usually divided into
is
two or more groups. The main division is done to compare the principal maneuvers investigation. under If the principal
maneuver cohort
the
an alleged cause of disease,
is
mav be
divided into an "ex-
posed" and "non-exposed" group, and the
exposed group to
may be
further divided in-
degrees of exposure. Thus,
smoking
is
the
studv, the cohort
if
cigarette
putative pathogen
may be
under
divided into non-
smokers, light smokers, moderate smokers,
and heavy smokers. If the investigated maneuver is a therapeutic intervention in the course of a disease, the cohort
is
di-
vided into a treated and a "control" group, or into groups
who have
received different
We
might then refer to an exposed cohort and a non-exposed cohort, or to a treated cohort and a control cohort. A second form of division for forms of treatment.
a cohort
mav be ancillary
is
the separation into strata that
response, or detectabilitv of tar-
Regardless of how the first or subsequent divisions are created, the key feature of cohort research is that the groups are followed forward, in a scientific direc-
from "cause" toward "effect." There is currentlv no word to describe
tion,
the population that
the
is
followed backward
opposite of cohort
research.
The
-
1L
only difficulty with probands (or
alternative, propositi)
is
that the
'
;
its
word has
been almost wholly preempted by cists. Almost 40 years ago, Sir Fisher 11 was using proband during cal analyses of genetic phenomena, word has often been applied in a
geneti-
Ronald statisti-
and the genetic
sense since that time. Despite the genetic connotation, however, probands refers to a
group of people with an identified disease, and the term could be quite satisfactory for our purposes here. It might still be chosen by epidemiologists who prefer it to the
alternatives that follow.
A new word
that might be created to group of diseased people is morbery, formed from the Latin morb( disease) and -ery (collection). This word has no genetic associations, but it has the disadvantage of implying the existence of a disease and this type of research may not always begin with an "effect" that is a distinctive disease. (We might be studying the "causes" of unemplovment by backward
describe
a
—
pursuit of antecedent characteristics in a
group of unemployed people.)
different in prognosis, compliance,
get event.
in
The
in several excellent analyses.
Regardless of
who do of
how we
label the people
or do not have the "effect," none
the terms just cited conveys the idea
backward direction, analogous to the forward direction connoted by cohort. In
of a
quest of a word for a group of people
lowed reetion,
a
in I
chronologically
have
searched
backward
fol<
unsuccessfully
through several dictionaries, a few varif^' of Roget's Thesaurus,
and somr
Other architectural problems
200
Latin lexicons.
The some
best
I
can do at the
philologic connoisseur moment, until comes up with a better idea, is to reverse the word cohort, and to propose trohoc as the name lor a group of people who are followed backward from "effect" or "noneffect toward "cause." The "effect" group would form a case, proband, or morherij trohoc. and the "non-effect" group would (
be the control or contrast trohoc. Thus, in the timing of data collection for a
research project, the words "prospective"
and "retrospective" would be replaced by protective and retrotective. In the directional pursuit ot a population, the words "prospective" and "retrospective'' would be cohort and trohoc.
replaced b\
Vnd yet, as Fox pointed out, the experts can be wrong. In the instance about which Fox complained, the term retrospective was erroneously applied by epidemi1
'
ologists to a "prospective" studv of a cohort
data were obtained retrofrom medical records. This type
of people whose lectivelv
1
be elimi-
of error in scientific thinking can
nated as a younger gene-ration begins to
outgrow or avoid the confusion transmitted In
A much
elders.
its
or
potential
however,
error
research,
scientific
trohoc
the
is
greater source of real
in
investigations
which the term retrospective might be
to
cor-
rectly applied.
These
potential
errors
seldom
when
emphasis
sufficient
receive
students
are
taught about the trohocs investigated in
The problems of trohoc research
B.
standing what
the
may be wrong with
the sci-
From
direction of trohoc research.
entific
main conceptual
errors that
have pre-
mentioned
things as relative risk ratios tistical
sibility
merely
distinguishing a cause from an
in
without the added burden of
effect,
tinguishing the
more substantive source ever,
is
trohoc
of difficulty,
research because
it
the
clinical
a cohort forward
clinical
tion,
experi-
entific
follows
from an imposed "cause"
an observed "effect." Accustomed to this standard direction of scientific thinking, uncertain,
or logically
encounters rection. If
confused
when he
complete reversal of that tJ
by a barrage statistical
uncomfortable,
F
di-
accompanied mathematical formulas and
retroversion
tabi ations,
promptly withdi melee, returning
w
the
is
clinician
may
from the epidemiologic
o the security of
more
familiar forms of sc "nee while hoping that
the epidemiologic experts will
they are doing and will do
it
know what right.
field.
At a time when all other basic aspects of medical education have come into ques-
A
investigator
may become
standard practice of leaders in the
do
to
the clinician
distorted or whollv erroneous does not re-
retroverted trohoc research has been the
how-
ments confined to the laboratory or will perform surveys and trials of clinical therapy. With either of these two forms of research,
pos-
might be immensely
that the data
ceive intensive discussion, perhaps because
seldom occurs
usually either
investigator will
The
manipulations of the data.
A
that clinicians are unfamiliar with
as part of clinical investigation.
on such and other sta-
dis-
reasoning.
of
direction
passing, but the instructor or
the textbook usually dwells mainly
we know
the problems doctors have had
in
dif-
backward direction may be
of a
ficulties
vailed at different times in medical historv, ot
The
"analytic epidemiologic" studies.
Clinicians often have difficulty in under-
however,
a
validity
in
reconsideration
epidemiologic
would not be out of lenge
is
A
place.
to arrive at a
way
of
sci-
trohocs
prime chal-
of illustrating
what might be wrong with the way that the backward procedure works, and with the data shall
it
provides.
For
purpose,
this
draw an analogy from the game
I
of
baseball, with apologies to readers in coun-
where the game is unfamiliar. Suppose we suspected that right-handed
tries
batters werc> ters,
this
among
professional baseball players
worse hitters than left-handed batand we wanted to get data to test suspicion.
approach
that
than a century statistics,
In
has in
the prospective cohort
been used
for
we would determine
the
of times that each type of batter bat,
more
the science of baseball
and the number
of times
number came to
that
were
The epidemiologic
We
followed by a hit or an out."
would
then calculate each batter's "batting aver-
age" as a rate of
make our
To
decision about left-handed versus
right-handed the
per times at bat.
hits
overall
we would compare
batters,
averages
batting
the
in
two
groups.
To construct we would put
a table
showing the
results,
the batters in the rows, as
independent
the
and
trohoc, the ablative risk ratio,
used
variable
for
the
terfield hits
'retrospective' researi
and
outs, rather than
we cannot
the batting events,
201
it
from
all
use a'+b' and
to represent times at bat. We can contemplate only the proportions of hits and outs that were associated with the two types of batters. Thus, we could calculate the proportion of R.-handed hits as
c'-id'
and compare the result with which is the proportion of R.-handed outs. If the first proportion was a'/(a'+c')
b'/(b'+d'),
we
"sampling," and the hits and outs in the
substantially
columns, as the subsequently observed phe-
would conclude
nomena. The table would contain the
worse than L. -handed batters. Anyone who understands the game of baseball should immediately recognize what is wrong with these trohoc sta-
fol-
lowing numbers:
Number
Number
of hits
of outs
R.-handed batters
a
L. -handed batters
c
b d
vs.
c/
(c+d). In
occurred
during
would bat or even what
game.
the
Instead,
we
might consider a particular location in the park, such as center field.
came
batted ball
a
have not compared true batting we have looked onlv the proportion of hits and outs in balls
batted to the center
into
center
Whenever field,
we
would determine whether it was a hit or an out. We would then ask the center fielder to inquire and let us know whether the batter had been right- or left-handed. We would then construct the following table:
going on;
batters
batters
Number
of center field
hits
ber
we have no
times
of
what was num-
idea about the
bat that culminated
at
as
drive
in
field
left
entirely If
we
pop-ups,
outs,
infield
become
batted balls that
or
foul-outs,
either hits or outs
or right field.
Our view was
limited to events in center
field.
are sure that the events occurring
R.-handed and L. -handed batters are
in
equally represented by what takes place in
center
field,
may be
our restricted observational but the only way we
valid,
can decide about such equal representation and validity would be to get results with a cohort approach, which we have not used.
Number
of center field
outs
To
translate this baseball
analogy into
b'
trohoc tactics
The rows
of this table contain the center
and outs that were the basis for our "sampling"; the columns contain the
field hits
subsequently observed "handedness" of the batters;
More im-
strikeouts, walks, bunts, infield singles, line-
focus R.-handed L. -handed
to
field region.
portantly, our station in center field gave
us a highly restricted view of
we
approach,
trohoc
the
not consider the times at
ball
We
tistics.
"batting average" percentages to be
compared would then be a/(a+b)
that R.-handed batters are
averages (or rates); at
The
lower than the second,
and the
correspond
interior letters are
with
their
chosen
presumptive
counterparts in the preceding table.
Because
we
derived our data from cen-
ease,
for
we need merely
center for center
ease
studying cause of
D
as
D
dis-
substitute a medical
field;
a person with dis-
a "hit" and a person without
an "out"; exposure to the putaR.-handed batter and non-exposure as a L. -handed batter.
disease
as
tive cause of disease as a
With these using
translations, an epidemiologist
medical-center
data,
assembling
diseased trohoc and a "case-control" gro> a
•In accordance with statistical custom in these matters, walk or a sacrifice hit would not count as a time at bat.
and noting exposure or non-expo< alleged "cause" in both group
the
202
Other architectural problems
tonus the same- type
analysis just cited
dI
ever
statistician
a
If
tried
to
analyze
baseball batting in this manner, the sports of the nation's newspapers would be with howls ot derision. Suppose the
•<
rilled
stm^
statistician,
cides
mains
ol
some
leave
to
also
or right-handedness
who threw to each batter. age and height are not known to
Since
have particularly strong ting ability of
player, the choice of age
ball
matching
variable's, rather
important features the
tray
on the bat-
and height
than the more served to be-
just cited,
unfamiliarity
statistician's
but important nuance's
subtleof
effects
an active professional base-
his
oi
left-
of the pitcher
as
His "im-
field.
and
non-pitchers);
re-
keeping center field data, but he will tr\ to
provement" consists the source
center
in
but he
analysis,
unable
unwilling or
observation posl
as
public laughter, de-
1>\
improve the
to
from
separated
the corresponding
for baseball.
in the
with
game
baseball.
the problems b\ "match-
Consequently, the statistician's proposed matching would be regarded as a futile
henever a hit or out appears in center he any attention to the (a d) terms that were meaningless the trohoc. These terms do not appear the odds ratio, which depends only on
be
of non-exposed people- will Let us further assume that the rate
1-e.
of disease in the exposed people
is
p,,
and
the rate in non-exposed people
is
p..
We
then have the following numbers of
longer paj
will
people
in
diseased
the individual values noted for
d
m
the trohoc
used
population. risk"
"relative
for
a. b, c,
and
The odds
ratio
trohoc
thus
a
in
—
eased
p non-diseased
non-exposed and disand non-exposed and l-p 2 )( 1-e). Suppose now,
however,
l-e);
i
==
(
that
the
individual risk rates, which were also not
higher than the rate. d 2
determined.
disease
since
it
quite pleasing statistically,
number
provides a single
be subjected
to diverse
mathematical
about
theories
variance, and "significance."
show is
The
result
it
fails to
The
the individual risk rates.
is
result
quite unpleasing scientifically, however,
because
To of
estimation.
pleasing medically because
less
that can
manipulations with
\alidit\
its
is
so
open
general population to the trohoc group.
we had
make
to
For
this
are both exposed and diseased, detection, di,
detected
in
general
the
of
assumption
to
who
are non-
of people with de-
exposed and and non-exposed and
therefore be as follows:
dis-
eased
dis-
pid,e;
=
For non-diseased peop _d.. 1-e numbers will remain: exposed and non-diseased = (l-pje; and non-exposed and non-diseased = (l-p 2 )(l-e).
eased
=
:
(
)
.
ple, the
If
a trohoc population
relative risk
(or odds)
is
assembled, the
ratio
be
will
Pid,e
(1-pQ d-e)
p= and spec-
tions
the
can now
specificity
data.
similar circumstances, however, such calcula-
at
We
3). re-
result.
The investigator laudably made no to calculate a
+
3
the palpation method gave a falsel)
both directions
in
so that the results were arranged in a 4
even larger pattern of for disagreement
cells
to
original table had
—
x 4
oi
the opportunities
would be even greater when
I
the data
were doubly dichotomized.
In addition to this difficulty, a separate prob-
ACTUAL CONDITION
SULTOF
Rl
\m
Major
PALPATION
fever
\fitlor
lem
fever
15
6
22
1106
cells
it
contains
—
table is
of a single
test.
Many
is
assumption
that
results
found
in several
exam-
different variables, not just in one. For
A
third
arrangement depends on a previous
decision about clinical tactics. that
we
will
Let us decide
always use a thermometer
the patient's temperature
if
to take
palpation indicates
the intermediate condition of
minor
fever. Fur-
thermore, for purposes of using palpation as a
"screening test",
let
us assume that
we
are
not really interested in circumstances where the
ple, in acute
tions of
symptoms, electrocardiographic
and laboratory
merated collection of entries from certain "ma-
sensitivity
palpation in "screening" for major fever. With
diagnoses.
is
these assumptions, five cells are original nine-fold table, th
'owing four-fold
and
removed from it
reduces to
table:
ACTUAL CONDITION RL IT OF PAL .HON Major
Major fever 15
No fever 3
No fever
3
993
test
pro-
obviously inadequate for determining the
and specificity of these complex
We
would need
to use
an expression
that contains multivariate constituents.
An
ex-
ample of such a variable would be fulfillment of composite criteria for diagnosis of acute myocardial infarction. The categories of this variable could be expressed in terms such as
yes or no (or uncertain). This method of citing the result of a multivariate diagnostic procedure
fever
A
cedure based on input from just one variable is
know
data,
acute rheumatic fever,
jor" and "minor" manifestations.
the reliability of
to
tests. In
the Jones diagnostic criteria call for an enu-
we
want
myocardial infarction, the clinical
diagnosis would depend on certain combina-
thermometer shows only minor fever. What really
the
the univariate result
medical diagnoses depend
on an aggregate of the
fever
any "two-
—no matter how many
the
entity being evaluated
fever
Not major
that arises in the construction of
way" contingency
major
would allow us
to
use a 2-way table for comparing the enumerated
On
data of whatever firm the
the sensitivity, specificity, and discrimination of diagnostic tests
method was employed
patients'
correct diagnoses.
to con-
clinically "silent" form.
On
covery
the
221
Examples of such
tests in lanthanic patients are the
dis-
uses of
other hand, because the constituent multivariate
a serum calcium
elements are
roidism, a fasting blood sugar for diabetes mel-
lost in a single
expression such
we would have no
as yes or no,
way
direct
of determining the causes of erroneous results
when they
occur.
To
track
down
the sources
of false positive and false negative diagnostic
we would have
errors,
go back and
to
start
is employed in situations where we have strong suspicions that the dis-
confirmation test
ease
with
biopsy tissue C. Relationship of index
Both of the just least
and purpose
been mentioned could be overcome (or
mathematical indexes for expressing the rela-
Youden's
J,
or the
"index of validity", or any other indexes
depend on doubly dichotomous data fold table, that
we could
that
in a four-
use indexes of association
allow the variables to have polytomous
(more than two) categories. Such indexes would include Kendall's tau,
G, Cicchetti's ous the (If
Goodman and
statistic
4 ,
Krushkal's
and some of the
vari-
"kappa" statistics described by Fleiss 9 or "lambda" statistics described by Hartwig 12 worst came to worst, or perhaps to best, we .
could simply enumerate the results according to the
proportions that were too high, correct,
To
and too low).
examination
of
a confirmation test for lung
confirmation for diabetes mellitus.
An
at
reduced) with a more sophisticated set of
tionships. Instead of using
microscopic
is
cancer; and a glucose tolerance test provides
have
difficulties that
statistical
The purpose of the test is to The performance of bron-
present.
is
verify this suspicion.
choscopy
with each of the multivariate constituents.
for hyperparathy-
or a rectal examination for rectal cancer.
litus,
A
measurement
consider the correlation be-
tween multivariate constituents of data and the patient's
confirmed condition, we could use
some of
the diverse correlation coefficients that
exclusion test
employed
usually
is
"rule out" the presence of a disease is
suspected. Such a test
it
usually too expen-
is
employed merely
sive or inconvenient to be
to
when
for
discovery purposes during routine "screening".
For example, a
examination might
stool guaiac
be used for the screening discovery of colonic cancer, but a
more elaborate roentgenographic
or colonoscopic examination would be needed
"rule out"
to
suspected.
disease
the
if
presence
its
is
Certain exclusion tests are cheap
enough and convenient enough screening purposes. Thus, skin test for tuberculosis
to be used for
when an
appropriate
negative, the pres-
is
ence of active disease can usually be excluded, although a positive
will neither discover
test
nor confirm active tuberculosis.
Some
good
tests are
three purposes.
Some
can be used for
for only one of these
are
good for two. Some For example, the
three.
all
can be derived from multiple linear regression
performance of sigmoidoscopy, together with
or discriminant function analysis.
biopsy and histologic examination in
managing
multicategory or multivariate data,
however,
These
statistical
will not solve a
improvements
more fundamental problem
describing the effectiveness of a
seems
to
costatistical
effectiveness is
test.
be almost wholly overlooked in
in
is
the purpose for
and
clini-
litus,
nostic tests are
employed
test.
we
too
inconvenient for
cancer, but cannot be used to exclude the dis-
for at least three dif-
and
we use a discovery test Examining who seem healthy, with no clinical com-
disease,
generally
Diag-
exclusion. During various types of "screening" .
plaints to suggest the
is
purposes of screening discovery. The histologic
examination of tissue from a bronchoscopic
ferent purposes: discovery, confirmation,
procedures,
A
can be used to confirm
exclude the presence of diabetes melbut
biopsy
The three types of diagnostic
people
to
test
test
which the
used. 1.
and exclude cancer of the rectum.
firm,
glucose tolerance
What
strategies for calculating a test's
when appro-
can generally be used to discover, con-
priate,
presence of a particular
often search for that disease in a
is
an excellent
ease or to discover
Since diagnostic different
purposes,
it
way
to
confirm lung
during routine screening.
tests are
the
employed
statistical
for these
indexes of
efficiency should be arranged accordingly. 2.
Requirements of detection and cc test, we want n
firmation. In a discovery
ably high sensitivity. If the disease
is pre:
Other architectural problems
222
should be found, even
the risk of getting a
at
high rate of false positive results.
[We
arc
willing to take this risk because a discovery test,
when
positive,
confirmation
want the
test].
followed by a
usually
is
an exclusion
In
we
test,
even higher than
sensitivity to be
Unless the sensitivit]
would keep us from being confident
that a
has excluded the disease.
tive test
The discovery and exclusion
are thus
both intended to have a high sensitivity for
when
detecting the disease
is
it
present.
the particularly high sensitivity that in
an exclusion
test,
we must
is
To
get
sought
be willing to pay
the appropriate clinical price. Thus, to lest urine tor sugar
is
a
good, cheap, convenient waj of
"screening" for the discover) of diabetes melbut the urine test will regularly give
litus,
false negative results.
sugar
a
is
To measure
more expensive and
discover procedure, but cally effective because
it
out
rule
diabetes
fasting blood
less
convenient
has a lower false-
negative rate than the urine to
some
more diagnosti-
is
it
test.
mellitus
If
with certainty,
We
would have to use the much more expensive and cumbersome mechanism of the glucose tolerance
test,
which,
identifying lung can-
positive
false
results,
miss lung cancers
but
regularly
it
that are located at inac-
For these reasons, many diagnostic used
regularly is
used
result is
A
tandem.
in
tests are
high sensitivity
to find the disease;
and a positive
followed by a high specificity
test that
confirm the diagnosis by "excluding"
will
arrangements, the best
of the paired tests.
of the
statistical appraisal
depend on
results will
its
Because of these tandem
possible falsehood.
a suitable
arrangement
such an arrangement,
In
the result of the pair might be called negative if
the detection test
negative; and the paired
is
would be called positive only
result
both the
if
detection test and the confirmation test are posi-
The
tive.
positive and negative results of this
kind of paired arrangement would have both
high specificity and high sensitivity.
we want
however, we cannot rely on either of these procedures.
gives
test
tests
is
The bronchoscopic biopsy almost never
cessible sites.
nega-
way of
but non-sensitive cer.
or
test.
the risk of a false negative result
1
a quite specific
will
close to
is
Conversely, a
positive bronchoscopies biopsy
in
a discovery 1,
turn out to have lung cancer.
in this instance,
D.
Choice of the tested populations
There are important
clinical reasons for try-
some of
the problems that have
ing to solve just
been discussed. Perhaps the most important
reason
that this
is
form of correlation between
would be both an exclusion and a confirmation
the result of a test and the patient's actual con-
test.
dition
By
contrast, in a confirmation test,
we want
extremely high specificity, with few or no false positive results. If the test
ease
shows
that the dis-
we want to be sure that it would have no real objection
present,
is
We
present.
is
to
is
the best
way
of making clinical sense
out of the statistical chaos that
now
demarcating the "range of normal" 8 mality"
is
exists in .
If
"nor-
determined purely on a univariate
basis, according to arbitrary statistical aries for a distribution of data, the
bound-
demarcation
zone of customary values for
occasional false negative results, since the con-
will indicate the
firmation test will probably be ordered after an
the test, but not their clinical connotations in
exclusion test was used to find any cases that
health or disease. If the demarcated zone
might otherwise be missed as false negatives.
have these
3. Combinations of tests. A single test can seldom be excellent for the goals of both detec-
tions
tion and confirmation. With rare exception, the same procedure cannot be sensitive enough to
This type of correlation can be achieved and
find
cases of the disease while simulta-
all
neously being specific enough to avoid false positive identifications. For example, the chest
X-i ly
of
fii
is
a quite sensitive but non-specific
ling lung cancer.
Almost
all
way
patients with
lung cancer have abnormal roentgenograms, but not
all
people with positive roentgenograms
clinical connotations, the
must be established
is
to
demarca-
in direct correlation
with an actual condition of health or disease.
evaluated only through the type of bivariate
arrangements we have been discussing.
The discussion so however,
only
clinico-statistical
with
far has
strategies
improving the defects.
been concerned,
defects
the
of existing
and with ways of
Unfortunately,
these
mathematical improvements will not solve the really
fundamental
biostatistical
problems of
On
the sensitivity, specificity,
many
diagnostic tests. Like so cated
other sophisti-
procedures, the complex in-
statistical
dexes of association produce elegant but super-
The indexes can provide
algebra.
ficial
useful
methods of quantitative expression for what has been observed
—but
dependent on what
And
data.
lem
the calculations are totally
submitted as the observed
is
the fundamental biostatistical prob-
lies in the
choice of the populations that
1.
The
poses,
who
it
role of clinical suspicion. If
in
we
are
diagnostic pur-
test for different
must be evaluated
groups of people
suitably represent the different diagnostic
These people cannot be chosen
challenges.
+
+
a"
and d values
The By
2.
the test's performance, at least
will
be divided according to the exis-
We
who
would thus
constitute the
whom
the test
would be used, during "screening", as a detection test. The second group of people would have medical conditions that aroused our sus-
made
picion of the disease and that or exclude
eightfold arrange-
why
to see
a par-
might have not one set of values
for sensitivity and specificity, but several different sets.
Suppose a positive
result in the test
depends on the disease having produced a of
of pathologic derangement.
derangement
occurs,
cer-
When
this
diseased
the
persons almost always develop symptoms that arouse suspicions of the disease. In such sus-
ordinarily healthy population for
it
derange-
pathologic
of
pected patients, the test will therefore have
choose one group of people
confirm
role
inspecting this
the tested population
tence of clinical suspicions.
to
screened and suspected
in the
and the evaluation of
affect the choice of a test
must
suspicions
clinical
screened
the
the suspected
analogous calculations would
ment of data, we can begin ticular test
level
preceding
c'
sep-
populations.
demonstrated to have the disease the
Two
population.
+
a")/(a'
be done for specificity, using the respective b
tain level
Since
+ c') for + c") in
a'/(a'
population and a"/(a"
merely according to whether or not they were in question.
+
(a'
we would determine two
c")],
values:
arate
223
of diagnostic tests
[which would be
sensitivity
ment.
are the sources of the data.
going to use a
and discrimination
us want
high sensitivity. ease
is
The customary fourfold diagnostic table would thus be converted into the following "eightfold" table:
the other hand,
the dis-
if
derangement, the
requisite level of pathologic
patient
may be asymptomatic and
part of a
screened population. In such a population, the
may have low
diagnostic test
Once we begin derangement 7 the causes of
sensitivity.
contemplate a pathologic
to
rather than the particular entity
,
that is called a
it.
On
present without having reached the pre-
'
'disease"
many
,
we can
also recognize
false positive or dispropor-
tionately positive results that can destroy the
value of a diagnostic
test.
For example, suppose
the positive result of a particular diagnostic test
ACTUAL CONDITION
RESULTS OF TEST
Negative
Positive
really
ploy
Screened population:
depends on a derangement
nutritional status, but this test for the
suppose
in the patient's
we want
em-
to
diagnosis of cancer. For the
we choose
Positive
a'
b'
evaluated population,
Negative
c'
d'
group from hospitalized patients with cancer,
Positive
a"
b"
Negative
c"
d"
Suspected population:
and the non-diseased group from healthy technicians, secretaries, and other staff personnel.
Since patients whose cancer If these
reality,
we would want
the results of their test are usually positive.
P, the prevalence of
low
in the
screened
the test results are correlated with the actual
late at least
two
and specificity ulation ulation.
severe enough to
require hospitalization are often malnourished,
population and high in the suspected population.
patients'
is
populations are going to approximate
the actual disease, to be
When
the diseased
condition,
we would
calcu-
sets of values for sensitivity
—one
and another
set for the set for the
screened popsuspected pop-
Thus, instead of a single value for
Since the staff personnel are well nourished, their test results are negative.
We
emerge from
the evaluation process with the belief that
have found an excellent new diagnostic
we
test for
cancer: the sensitivity and specificity values ar quite high.
After the test begins to be applied,
be chagrined to discover that
it
realh
we
•
224
problems
OtJicr architectural
test fails to
cancer that has been missed by a liver biopsy;
neoplasms of asymptomatic well-
and a positive chest X-ray can detect tubercu-
and low
sensitivity
the
detect
specificity.
The
nourished patients with cancer; and false positive
gives
it
diagnoses of cancer for malnour-
ished patients with stroke, chronic cardiopul-
monary disease, or
we
cause
Be-
certain enteropathies.
failed to include such patients in the
we
original test population,
of the
inefficienc)
test
did not discover the after
until
it
Surrogate
pathognomonic
is.
ical
that
example, the palpation
used
procedures
paraclinical
for
o\'
a suitable sized
woman would
be pathoalso be cither
that
demonstrate, or otherwise identify
delineate,
a particular disease. For
example, the histologic
findings in an appropriate tissue specimen will
pathognomonic of cancer or
hepatitis;
a
specified set of values in a glucose tolerance test will
be pathognomonic of diabetes mellitus.
to identify.
Examples of cancer, serum
(SGOT)
for
may make the pap smear falsely negative. The electrocardiogram may fail to show a myocells
cardial
infarction
acute attack and
taken too early after the
if
may
give false positive results
Many
because of some other myocardopathy. chemical
tests give falsely
high results in
re-
appropriate clinical conditions. 4.
The process of discrimination. For
all
these reasons, a proper evaluation of the surro-
chest X-ray for tuberculosis, electrocardio-
gate procedures that are called diagnostic tests
for myocardial infarction, or urine sugar
for diabetes mellitus.
pathognomonic
sensitivity
and
test is
seldom evaluated
We may
when
for
worry about
a pathologist inter-
prets a tissue specimen; or about the standards
of glucose
ingestion,
specimen timing, and
chemical measurement when a laboratory per-
forms a glucose tolerance concerned that the is
test;
test itself
may
but
we
are not
be misleading.
main
the surrogate tests that create the
problems of sensitivity and
the disease.
We
that
A
specificity.
rogate test does not identify the disease;
something else
we hope
it
will
sur-
iden-
denote
often use surrogate tests be-
they are simpler, cheaper, and more con-
:
venu
rhan the corresponding pathognomonic
urrogate test
1
Fc
alkaline
may
also be
more
sensi-
example, a measurement of serum phosphatase
would require them challenges
specificity.
observer variability
tive.
smear
cancer; and inaccessibility of the desquamated
for hepa-
gram
tifies
Thus, inflam-
the results can be falsely lowered under other
surrogate tests are pap smears for
It
clinical conditions.
create a false positive pap
be used to represent or approximate
glutamic oxalic transaminase
A
may
sponse to alternative diseases and drugs; and
test,
the disease we want
titis,
will consist of alternative pathologic de-
rangements or mation
into
test
commission. These mech-
errors of omission or
anisms
therefore contemplate the
might "trigger" a
an entity
In a surrogate that will
we examine
it
template sources of false positive and false neg-
we must
in a
a surrogate test,
is
The
tests.
gnomonic of pregnancy. This term can
test.
Because the procedure
depends on a pathologic entity that is different from the one we are trying to diagnose. To con-
mechanisms
spontaneous movement within
cau-
often produce false results and the problem of
ative results,
suprapubic mass
To com-
surrogate tests
evaluating sensitivity and specificity.
usuall) applied to a clin-
is
ticular condition. Fur
be
pensate for these advantages,
manifestation that uniquch indicates a par-
term pathognomonic
tubercle bacilli in the
microscopic examination of sputum.
became
clinically popular. 3.
shown
has not
losis that
may
detect
metastatic
in
able to discriminate logic
to receive several different
discrimination.
derangements
among that
The
test
must be
a variety of patho-
might simulate either
the target disease or an entity in the clinical
and paraclinical spectrum of various groups of patients
that disease.
who
The
enter the evalu-
ated population must be selected according to their suitability for providing these challenges. If patients
are chosen merely because they
do not have
do or
the target disease, the discrimina-
tion of the test will not be adequately evaluated.
The choice of
patients to provide appropriate
challenges will depend on both the medical
spectrum of the disease and morbidity.
its
diagnostic co-
The medical spectrum 5 of
the dis-
ease refers to the array of clinical and paraclinical
laboratory
abnormalities
that
it
can
produce. The diagnostic co-morbidity of the disease consists of other diseases that might
On
be mistaken for
the sensitivity, specificity, and discrimination of diagnostic
it.
Diagnostically co-morbid
or by having similar paraclinical dysfunctions.
diseases are usually entities occurring in the
The
same topographic location of the body or producing somewhat similar morphologic or other
purposes
paraclinical abnormalities. pri-
sensitivity of a test used for discovery
"screening"
in
performance
its
mary lung cancer would indicate patients with hemoptysis, with major weight loss, and with
bers of groups
abnormal chest roentgenograms. The spectrum
sults
co-morbidity
diagnostic
of
would
include
patients
lung
for
with
cancer
non-neoplastic
will
in
would seem
will
mem-
specificity of the test re-
be a
minimum demarcation
of
needed
appropriate
in
The complexity of these arrangements may seem distressing, but they are ultimately less
therefore want to challenge
distressing than the continued proliferation of
the test with patients
Our
The
in identifying
its
The
To evaluate the disnew test for lung
lesions.
crimination of a proposed
who
represent different
medical and co- morbid spectrum.
parts of the
2.
would be
groupings
we would
.
avoidance of false positive
its
to
circumstances.
cancer,
1
the necessary comparisons, but additional sub-
chronic bronchitis) and with metastatically neo-
pulmonary
Group
groups 3 and 4. These four groups
pulmonary diseases (such as tuberculosis and plastic
and
1
depend on
in
for exclusion purposes
sensitivity
depend on
depend on
will
capacity to identify patients test's
For example, the medical spectrum of
225
tests
investigated population might thus in-
whose inadequacies escape
diagnostic tests tial
evaluations because the
ini-
evaluations
initial
did not contain suitable challenges.
The over-
clude the following groups of people: asymp-
simplification of the existing tactics for getting
tomatic patients with lung cancer; patients with
"control"
lung cancer and only primary symptoms, such
indexes has led to the spawning of
hemoptysis;
as
patients
whose lung cancer
are
that
and calculating
groups
unsatisfactory
grossly
for
symptoms include such systemic effects as major weight loss; patients whose lung cancer
purposes.
manifestations include such metastatic effects
frontation with clinical complexity.
To
statistical
many
tests
clinical
deal with clinical reality requires a con-
The new
hepatomegaly or bone pain; asymptomatic
arrangements proposed here are both feasible
pulmonary disease;
and analyzable after the appropriate data have
hemoptytic patients with other pulmonary dis-
been assembled. The performance of such com-
as
patients with other causes of
ease; patients with
major weight
loss
due
to
plex analyses
not at
is
all
a novel idea.
many
has
It
other diseases; and patients with hepatomegaly
been, in fact, performed for
or bone pain due to other diseases.
a generally unquantified procedure called clin-
In a
more general statement of
the populations
used
to
principles,
evaluate the discrimina-
tion of a diagnostic test for Disease
X
should
from the following
consist of representatives
groups of people: 1
X who
Patients with Disease
are
asymp-
Patients with Disease
X who
are
symp-
cover the medical spectrum of the
With increasing advances
.
in
technology, clinicians will increasingly have to evaluate the costs, risks, and diagnostic dis-
new
crimination of
science, clinical
tomatic with a diverse collection of manifestations that
judgment 5
evaluations
tomatic. 2.
ical
years during
are
the
to
diagnostic tests. If these
provide
subtleties
judgment
sensible
and
must
be
clinical
complexities
of
acknowledged,
adapted, and incorporated into the plans for
choosing the patients
who
are tested
and for
quantitatively expressing the results.
disease. 3.
Patients
without
Disease
X who
have
other diseases that have produced overt manifestations similar to those in the medical spec-
trum noted
in
4. Patients
Group
1.
X who have other
1972.
mimic Disease X's pathologic 2.
derangement by occurring
On comparisons of sensitivity, and predictive value of a number of diagnostic procedures. Biometrics 28:793-800 Bennett, B. M.: specificity
2.
without Disease
diseases that can
References
in a similar location
Bergeson, P.
S.,
dependable
palpation as a screeniiv
is
and Steinfeld, H.
J.:
I
Other architectural problems
226
for fever? Clinical Pediatr. (Phila.) 13:350-35
1
1
1.
3.
Berkson.
J.:
"Cost-utility"
the efficiency of a test.
as
a
Amer.
J.
measure of Stat. Assn.
Cicchetti. D. V.:
between
rank
A new
ordered
Feinstein, A. edition).
Am.
13.
R.: Clinical
judgment (reprinted
14.
Huntington. N. Y..
Feinstein. A. R.: tion
7.
of
Proc.
1974. Robert E.
co-morbidity
I.
chronic
in
The domains and
clinical macrobiology. 46:212-232. 1973.
9.
The evalu-
classifica-
disease.
Yale
J.
Feinstein. A. R.: Clinical biostatistics.
Methods
J.,
test
Inf.
and Telisman, Z.: Vadesignated by a single
Med. 12:244-248, 1973.
Sunderman,
1964.
W., and Van Soestbergen, A. compu-
F.
A.: Laboratory suggestions: Probability
tations for clinical interpretations of screening tests.
Freeman, L. C: Elementary applied statistics: For students in behavioral science. New York, 1965. John Wiley & Sons, Inc.
J.
of a diagnostic
16.
Med.
The derangements of the range of normal, Clin. Pharmacol. Ther. 15:528-540. 1974. Fleiss. J. L.: Statistical methods for rates and proportions. New York, 1973, John Wiley &
the
Nissen-Meyer, S.: Evaluation of screening tests in medical diagnosis. Biometrics 20:730-755,
J
XXVII.
of
Bchav. Sci. 18:307-310.
15.
disorders of Biol.
significance
Statistical
F.:
coefficients.
Muic, V., Petres. lidity
17.
Am.
J.
Vecchio, T.
Clin. Pathol. 55:105-111,
J.:
1971.
Predictive value of a single diag-
nostic test in unselected populations, N. Engl.
J.
Med. 274:1171-1173, 1966. 18.
Yerushalmy.
J.:
Statistical
problems
in assess-
ing methods of medical diagnosis, with special
Sons, Inc. 10.
N.:
Mantel. N.: Evaluation of a class of diagnostic tests. Biometrics 7:240-246. 1951.
function,
The pre-therapeutic
Chronic Dis. 23:455-469, 1970. Feinstein, A. R.: An analysis of diagnostic reasoning.
8.
W., and Mantel.
1973.
measure of agreement variables.
Krieger Publishing Co. 6.
Hartwig.
Lambda
Psychol. Assoc. 7:17-18. 1972. 5.
S.
1950. 12.
47:246-255, 1947. 4.
Greenhouse,
ation of diagnostic tests. Biometrics 6:399-412,
1974.
reference to X-ray techniques, Pub. Health Rep.
62:1432-1449, 1947. 19.
Youden, W. J.: Index for tests. Cancer 3:32-35, 1950.
rating
diagnostic
SECTION THREE
PROBLEMS In discussing the
MEASUREMENT
IN
methods used
for assembling
with the assumption that the basic information the statistical manipulation. For the
main challenges are
many
in the data,
architecture requires decisions about
and analyzing
data,
we
begin
(or will be) there, awaiting
is
activities in
medical research, however,
not in the analysis. This aspect of research
what data
to get
and what methods
to use
measurement
that converts an observed entity into an item
of data. Statistical strategies are
seldom pertinent for these challenges, because
for the process of
the mathematical models are concerned with numerical methods of analysis, not
with
scientific
The
methods of mensuration.
four papers in this section deal with a few of the
lems in clinical measurement that require
For one
clinical rather
need an
set of solutions, clinical investigators
many important
than
intellectual liberation
the entrenched but erroneous doctrine that data can be "hard" only in
dimensional numbers. The persistence of
this
prob-
statistical solutions.
if
from
expressed
fallacious doctrine has
been
abetted by an unbalanced emphasis on parametric forms of statistical analysis
and by the
scientific aberration that
occurred
many
years ago
when Gaussian became
ideas about the variance of different measurements for a single entity
applied to the distribution of single measurements for different entities.
A
sep-
arate set of solutions will involve remedial action for the inappropriate use of the
word "normal"
to describe the
shape of a
statistical curve,
tempts to establish medical normality solely according to Statistical proposals for the
and
for
misguided
at-
statistical locations.
assessment of safety and efficacy have been scien-
what is meant by by "efficacy-" The many published reports of adverse drug reactions have not been accompanied by a reproducible operational identification of adverse drug reactions; and efficacy is constantly appraised with techniques that do not encompass the wide spectrum of phenomena requiring consideration. To tifically
unsatisfactory because of the failure to delineate
"safety" or
accomplish a large-scale surveillance of the effects of therapeutic agents, careful attention will be
needed
clinical analyses for the
and the
for
many complex
differentiation of effects.
simplified
when
issues in clinical specifications
and
populations to be observed, the methods of observation,
These
issues are usually disregarded or over-
the investigators hope that the problems will be solved mainly
with mathematical inspirations or computerized extravaganzas.
227
CHAPTER
On
16
exorcizing the ghost of Gauss
and the curse of Kelvin
At any era in the history of science, the advance of science has been retarded by certain fundamental concepts that were enlightening when introduced,
further
but that
came
too long, be-
later, after persisting
barriers to future progress.
The Structure of Thomas S. Kuhn 27
In his perceptive book, Scientific
Revolutions,
the
way
stultified
that
attempt to force nature into the conceptual boxes supplied by professional education." 29 When nature refuses to remain in those boxes, "anomalies, or violations of expectaattract
tions,
emergence of
"universally recognized scientific achieve-
by repeated
ments that for a time provide model problems and solutions to a community of prac-
conform." 28
titioners
(of
science)."
examples
As
of
change in such paradigms, Kuhn cites the transitions from Ptolemaic to Copernican to Keplerian theories of astronomy; and from Aristotelian
to
Newtonian
concepts of dynamics.
to
Einsteinian
An example
history of biomedical science
in the
would be the
The
initial
12:1003, 1971.
this
chapter originally appeared as In Clin. Pharmacol. Ther.
— XII."
.
.
(to)
.
.
the
.
make an anomaly
failure to
response to a
crisis
devise "numerous articulations
is
resis-
paradigm and ad hoc
eliminate any apparent conflict." 32 Eventually,
however, after "normal science
peatedly goes astray
.
.
.
re-
the profession can
no longer evade anomalies that subvert the
The
biostatistics
.
may be induced
modifications of their theory in order to
from pre-Galenic times to Galen to Harvey to modern beliefs. Kuhn notes that contemporary paradigms have important roles as stimuli to the growth and development of the "normal science" of an era, but he also points out
Under the same name,
community
crises that
tance, as the defenders of the old
existing
"Clinical
the increasing attention of
(the) scientific
successive basic alterations in ideas of cardiac physiology
to an outdated
paradigm produces "a strenous and devoted
used the term paradigms for these fundamental concepts, which he defines as
has
research becomes
creative
when adherence
tradition
crisis
32 of scientific practice."
becomes intolerable and a
"sci-
producing a new of commitments, a
entific revolution" occurs,
paradigm, "a
new
new
set
basis for the practice of science." 30
My
object in this essay
is
to call atten-
two outdated paradigmatic concepts that have been stifling the intellectual growth of clinical biostatistics. One of the? concepts an extension of ideas often tributed to Carl Gauss is the belief tion to
—
:
—
t
229
Problems
230
in
measurement
the observed data of clinical medicine can
be expressed with "normal"
usually
dis-
tributions for "continuous variables" hav-
ing a "variance" that can be calculated
from the observations. The second concept
—an
extension of beliefs stated
—
by Lord
on cither side of the mean were next most frequent; large errors were uncommon, although present in about equal frequency on both sides of the mean; and extremely large- errors were rare. When the frequency of the individual measurements was graphierrors
the idea that scientific data must
cally plotted against their values, the result
be expressed objectively in the form of dimensional measurements. Both of these concepts provided major enlightenment when they first became accepted as paradigms; both have now led to major intellectual crises that remain unsolved by various ad hoc modifications of the basic paradigms; and both are now being used to substitute for enlightened thought or to
was a symmetrical "bell-shaped" or "cocked-
Kelvin
thwart
is
it.
with its apex at the value mean. The name "normal" was given to the pat-
hat"
curve,
for the
shown in this curve, become the familiar "normal distribution" on which so much statistical reasoning has depended during the past century. The important intellectual advances that have come from this reasoning
tern of frequencies
and
has
it
known
are too well
The ghost
1.
Gauss
of
here.
As technologic devices of measurement began to proliferate during the 19th century, scientists became confronted by a
phenomenon variability.''
that
we now
call
same object did not
yield identical results.
The disagreements immediately provoked
how to choose a correct among the diverse measure-
the question of
value from
The obvious answer to this questo designate the mean of the measurements as correct. With this assumption,
ments. tion
was
is
less
to
need recapitulation
apparent, however,
enormous obstacle that imposes to progress
this
is
reasoning
the
now
in clinical biostatistical
research.
"observer
Repeated measurements of the
What
A.
The ambiguity
of 'normal.'
The
deci-
sion to call this type of curvilinear sym-
metry a "normal distribution" was a reasonable act of mathematical nomenclature. The word "normal" has often been applied in various aspects of mathematics and the natural sciences. In clinical medicine, however, the word normal had already been established, long before
its statistical
usage,
the amounts by which values differed from
in reference to a quite different connota-
mean would be regarded as errors. The magnitude of individual deviation from the mean could be calculated for each of the n measurements; the individual deviations could be squared (to remove negative signs); the squared deviations could be added together; and the sum could be divided by n to provide a "mean" of the squared deviations that was called the
tion:
the
error variance.
The square
root of this error
variance was designated as the standard deviation.
One of
C
of the great statistical contributions
Friedrich Gauss was to notice that
1
these
rors in
distribu
metry.
T.i
1
in
measurement were usually a specific pattern of sym-
most commonly repeated single
value was usually the
mean
itself;
small
the distinction between health and
disease.
The
definition of this distinction
is
an issue too fundamental and complex for further discussion here.
Suffice
it
to say
word normal has two entirely difmeanings in its two types of usage.
that the ferent
Statisticians refer to the
shape of a curve;
clinicians refer to a state of well-being.
The
extensive medical problems created
by the confusion between these two ideas about normal will be reserved for a later paper in this series. For the moment, however,
we can
note that clarity of thought
would best be served had another name. campaniform, which has been
in clinical biostatistics if
the bell-shaped curve
The
adjective
proposed 40 for the curve's configuration, might be a satisfactory "generic" designa-
On
exorcizing the ghost of Gauss and the curse of Kelvin
but a commemorative eponym seems more appealing. The phrase Gaussian curve is readily understood and has already been used by many writers, although it perpetuates the same type of historical "injustice" created by the many medical eponyms in which the commemorated pertion,
son
is
man who
the
popularized a
first
phenomenon, rather than the man who described
it.
The
earliest
bell-shaped curve
first
account of the
was by Abraham De-
B.
The choice
of a 'normal range.'
choice of a "normal range."
is
what
is
the appropriate population
ment. Since only a single object had been measured, the variance clearly did not refer to the object itself.
During the 20th century, however, the been altered,
original logic of variance has
so that variance
may now
refer to the ob-
jects themselves, as well as to the
measured
We
same
has at least three components: (1) should depend on the charof a state of health or
ferred to the values per act of measure-
the
The problem
on the
value of a numerical measurement;
-» re-
An-
the idea of normality acteristics
JECT -* ACT OF MEASUREMENT VALUE, the mean and the variance
can therefore talk about the variance of the data regardless of whether we get 30 repeated measurements of the
Moivre, not by Gauss. 45 other major problem in normality
231
(2)
whose
values.
object, or a single
measurement
for
30 different objects. In the first kind of variance, however, we refer to the process of measurement, whereas in the second, we usually ignore the mensurational activity
and
refer to a characteristic of the
object.
From
measured
variance per measured value,
individual
the term has been altered into variance per
merical measurements
measured
medical characteristics or nu(or both) will be used to determine normality; and (3) by what method should boundaries be demarcated to create a range? These questions have been increasingly debated in recent years, particularly as a "range of normal" has been sought for the
abundance of new paraclinical data produced by the increasing medical use of
when we sodium
look at the single values of serum
for a group (or "sample") of peo-
we get the variance One obvious problem
ple,
of the people.
contribu-
some statistical textbooks. The phrase may be appropriate in refer-
My own
title
here from the
title
of an excellent
recent discussion ("Health, Normality, the Ghost of Gauss" )
by Elveback,
and
Guillier,
and Keating. 10 C.
variance of the laboratory procedure; but
debate will be deferred for a paper, but I have borrowed part of
chemical technology.
my
we
For example, when
of ambiguity in double usage of the same term is caused by the persistence of the phrase
tions to this later
object.
repeatedly measure the value of sodium in a single specimen of serum, we get the
The
logical alteration of 'variance.'
development of Gaussian the term variance referred to in-
this
error variance in
ence to a process of measurement, but the term error is improper and confusing when it prefixes a variance that refers to a group of people. The reader may be misled into believing either that the measured values
In the original
are themselves incorrect (a point that
statistics,
seldom
consistencies
in
observation.
When
the
at
issue),
wrong about. the people who deviated from
same object had received a series of nonidentical measurements, the deviations from the "correct" mean would be the source of error variance. The term error was thus attached to variance in the 19th
the mean. Aside from this semantic
century in order to connote disparities in
popidational
variance.
When
chemical
performed
in
the act of mensuration,
and the term error
variance referred to the size of the discrepancies. Thus, in the sequence of
OB-
is
was
or that something
diffi-
however, the logical alteration of variance has had profound effects on the use and abuse of modern statistics. culty,
(1)
The
distinctions of quality control
tests
modern
clinic
laboratories are checked for "qualitv trol "
and
manv
the
a traditional Gaussian calcula^
Problems
232
applied
in
values
to
measurement
obtained
in
repeated
measurements of the same specimen. The variance
of
these
multiple
helps
values
and precision
indicate the reproducibility of the test.
an array of results is assembled for specimens from a group of patients or
from the same
successive specimens
however, the calculated variance contains an admixture of variances arising patient,
not only from the difference
mens but
but comprised a mixture of newborn fants, adult
in-
pygmies, and professional bas-
ketball players.
When
for
contained people, mice, elephants, and gerbils, or if the group was confined to people
from the \k
also
when
in the speci-
s
issitudes of the
The complex problems determining homogeneity for a later paper in this point to be noted now is and-egg" type of problem
and be reserved
of defining will
series.
The onlv
that a "chicken-
has been created
— —
by the frequent use of variance newer, logically altered meaning
in
its
as
a
amount of variation attributable to the test. The necessary adaptations, which are beyond the scope of this
measure of homogeneity. Do we first decide that a group is homogeneous and do we then determine the variance, or do we first measure the variance and then decide that the group is homogeneous? The classical statistical approach to this question is to determine homogeneity from
an
the calculated variance. If the co-efficient
The
which is the ratio of the standard deviation divided by the mean, is below 10%, the group is often regarded as reasonably homogeneous. Statistical strategies and tactics are permeated with the idea that homogeneity is determined from post hoc calculation rather than from a priori classification. In most textbooks and other statistical literature, the term "homogeneity" rarely appears alone, and is generally used in the context of homogeneity
Nevertheless,
test
two
the data for
such specimens are compared, most clinicians usually decide that one value is higher or lower than considering
the
other without
the
essay, are particularly well discussed in
intriguing
new book bv Roy
main point
Barnett.'-'
be noted here
to
is
variations in people, as discerned
that the
from a
cannot be properlv interpreted with-
test,
out considering the variations in the test itself.
(2)
when
Homogeneity
of sample. In the davs
depended on the measurements of a single object, statisticians did not have to worry about the homogeneity of the objects being measured. The sampled object was "homogeneous" unto itself. When measurements are performed for a group of objects, however, the issue of homogeneity becomes fundamental variance
always
of variation,
of variance. Diverse "corrections"
inferences and tests in circumstances where the variance is not "homogeneous. tistical
The
to the scientific validity of the group.
In order to apply any statistical calcula-
depend on a group
we
and other
adaptations have been developed for sta-
classical scientific
question, however,
is
to
approach to
this
regard homogeneity
homogeneous, they can readily be combined for the calculation of a collective mean and variance. If the
taxonomy, not in statistics. make an a priori judgment about homogeneity before we measure, not afterward. We decide, as an act of taxonomic classification and diagnostic identification, whether our observed objects
objects are too heterogeneous, however, the
are mice or elephants,
tions that
must
first
of objects,
decide that the objects can be
considered collectively as a group. If the objects are sufficiently
collective values of
produce
statistical gibberish.
we might have no the
mean and
mean and
variance
may
For example,
objection to calculating
variance for the height of a
group of people, but we might greatly demur from such calculations if the group
an issue
as
Scientists
in
usually
newborn babies
professional basketball players. rely
We
or
do not
on measurements and calculations
for
these issues in identification. For example, if
had a mean weight of pounds with a standard deviation of pounds, a classical statistician might
a group of objects
15.1 0.2
On
exorcizing the ghost of Gauss and the curse of Kelvin
conclude that the objects were quite homo-
geneous because their co-efficient of variation is only 1.3%.* This conclusion might be entirely justified with respect to the weight of the objects, but a classical entist
sci-
would demand more description of
the objects before drawing any conclusions
about their homogeneity.
He might
discover that the group consisted of
then litter-
mates chosen from kennels of large small dogs,
and huge
cats,
tion,
may
taxonomy
the available categories of
not be satisfactory for decisions about
homogeneity. Thus, the classical taxonomy of biology
adequate for distinguishing
is
different four-legged animals as mice, dogs, cats,
cows, horses, or elephants, and for dis-
tinguishing different species animals.
The
available
among
these
taxonomy of human
chronology and occupations
taxonomy remain neglected, and if Gaussian variance continues to be regarded not only as a description of a group, but also as a
primary index of homogeneity. D. The idea of continuous variables. Perhaps the most intellectually pernicious current residue of Gaussian statistics is the abstract mathematical concept of continu-
One
ous variables.
of the basic tenets of
the mathematical expression for the Gaussian curve
birds.
In certain forms of scientific classifica-
233
based on
and of the
statistical
this expression
is
reasoning
that the variables
A
variable is regarded as can take on additional values that lie within any defined interval of values, no matter how small the interval becomes. For example, serum sodium can be regarded as a continuous variable. If we could measure it as finely as 136.75 or 136.76 meq./L., it can still assume the
are continuous.
continuous
it
if
adequate for
values of 136.752 or 136.753 within that
newborn babies from professional basketball players. But classical taxonomy has not been equally satisfactory
our measuring device were preenough to identify the latter two values, serum sodium could still be 136.7528 or 136.7529, and so on.
is
distinguishing
for distinguishing different species of entities
as bacteria or
such
A
form of has now been de-
worms.
numerical taxonomy6 38 veloped to make such distinctions by cal'
culations based
on the measured values of and chemical properties
interval. If
cise
All the mathematical concepts of analytic geometry and the calculus depend on continuous variables, as do most of the concepts of mathematical statistics. As soon origins
different physical
as
of these entities.
and enters the world of reality, however, a new symbol is necessary to indicate that the real world does not permit such niceties in measurement. This symbol is the X sign
In the classification of clinical ena,
many important problems
phenom-
of hetero-
geneity have been overlooked during the statistical
attention to variance as an ex-
clusive index of homogeneity.
Among
such
problems are the inadequacies of the current taxonomy of disease, which cannot be
that
is
the improvements will require careful con-
point as
have
been omitted from current classifications. 11 These improvements will not occur, however, if the intellectual problems of medical '1.3%
0.2 x
15.1
100.
emblem
of statistical
measurements used in statistics were minute enough to be truly continuous, we would not need the X sign.
To
sideration of important variables that
theoretical
its
the traditional If
activities.
remedied merely by calculations or by reof categories of coding for the diseases. 11 The prognostic inadequacy of diagnostic nomenclature also cannot be improved by statistical tactics alone, and
shufflings
leaves
statistics
the
calculate the
tinuous variable,
sum x,
of values for a con-
that
quencies at each value, x
had a "density
had
different fre-
we would
say that
function," f(x),
which
represented the frequency of x at each curve.
it
"moves" continuously along
The sum
of the values of x
then be expressed as integral sign
(
/
)
for
its
would
fxf(x)dx, using the
sums
of continuous
variables.
In both biologic and statistical rea^ however, we cannot determine f(x
Problems
234
in
measurement
the idealized continuous values of
we
stead
and we count the
x at a scries of
i
corresponding
frequencies.
points,
those metric values as x
and the sum of
fi,
(
,
the
Furthermore, when
if,\'i-
In-
x.
note the actual metric values of
We
express
the frequencies as values
for
we draw
a
graph
show the pattern of these frequencies, we do not draw a continuous curve. Instead, we draw a "frequency polygon" that connects the observed points with straight lines.
we do
not
draw
a line at
all,
and we construct a bar-graph called
a
"histogram."
A
statistical
metric values of a continuous variable
would be merely pedantic were
it not for another crucial feature that differentiates
clinical biology
have been
from the statistical models proposed for it. Many im-
paradigm that depends on
continuous metric variables will frequently
produce biologically peculiar the paradigm
is
results
when
applied to variables that
are either metrically discrete or non-metric.
Some
of the peculiarities have occurred so
often that the classical
may be abandoned
mode
of calculation
in favor of
an alterna-
expression. Thus, rather than saying
tive
This distinction between the ideal and real
definitely absent.
as
x
to
Alternatively,
rheumatic fever. An existential scale can be converted to ordinal rankings with such gradations as definitely present, probably present, uncertain, probably absent, and
that the average
mean
American family has
we would
children,
2.3
avoid calculating a
that creates the strange 0.3 child,
and we would state, instead, that American families have a median of 2 children. When a variable
is
expressed nominally, neither a
portant clinical variables arc discrete rather
mean nor
a
median can be used
They
culating a
summary
that
than
continuous.
specific
categorical
numerical
or
are
terms
verbal.
expressed
may be
that
Some
in
of
these
of the values,
for cal-
and the
would be stated as a mode or proporwe might note that chest pain the most common (i.e., the mode) pre-
result tion.
Thus,
discrete variables are also metric, in that
is
they can be expressed on a ranked scale with equal intervals between adjacent
senting complaint among patients in a coronary care unit, or that chest pain appeared in 76% of such patients.
ranks. Illustrations of discrete metric vari-
number
ables are
of children or
number
In
of
many
other instances, however, a suit-
previous myocardial infarctions.
able alternative expression has not yet been
Many other important clinical variables, however, are not even metric. They are expressed as discrete categories in "scales"
developed, and ad hoc modifications are still being proposed for the anomalies pro-
of values that are either ordinal, nominal, or existential.
values
as
An
ordinal scale contains such
none, mild, moderate, and ex-
duced by an unsatisfactory statistical paradigm. For example, to predict survival from the
many
variables that describe a patient's
initial state,
modern
the main paradigm offered by
treme for the variable, severity of chest pain, or such values as 0, 1+, 2+, 3+, and 4+
tiple linear regression.
for the variable, briskness of patellar re-
ing, this
Although an ordinal scale is demarcated into ranked values, the interval between any two adjacent ranks is not measurably equal. A nominal scale contains ranked values such as male and female
have been described elsewhere. 25 I shall here cite only two of the statistical dif-
flex.
fc
die
me patio
variable, it,
sex;
and other
or doctor,
lawyer,
for the variable, occu-
\n existential scale contains such
unranL values as yes and no, or present and abst for the variable, presence of chest pain, or for the variable, presence of
statistics is a
technique called mul-
For
technique has
clinical reason-
many
defects that
Since the regression technique depends on numerical data, it cannot accommodate existential or nominal variables, which are not expressed in numerical values. Accordingly, the ad hoc modification ficulties.
is
to assign arbitrary
for
these
variables,
numbers so
that
as "values"
absent
may
become and present, 1; or male may become 2 and female, 3. An additional prob-
On
lem of the multiple regression procedure is that its array of numerical co-efficients,
when applied
to the data for a particular
may sometimes
patient,
yield a negative
value for the predicted survival. modification for this anomaly
The ad hoc to apply a
is
makes the
"logistic" correction that
235
exorcizing the ghost of Gauss and the curse of Kelvin
regres-
sion procedure always yield a positive value.
As long as the tactics are otherwise suca gerrymandered statistical procedure need not create any major biologic difficulties. The main scientific hazard of the continuous-variable paradigm occurs when a devotion to such variables becomes cessful,
The curse
2.
Kelvin
of
About 90 years ago, a time when the work of Charles Darwin was drastically altering some fundamental paradigms of
human
biology,
when
the
concepts
of
Rudolf Virchow had led to the new paradigms of medical histopathology, and when Francis Galton was performing the magnificent analyses that would culminate in the development of biometry as a distinctive
new
intellectual discipline, a recurrent
exhortation was being offered to physicists.
The exact words of Lord Kelvin's theme, as Kuhn 26 has noted, appear in diverse
"When
the basis for rejecting or disdaining data
forms, but the basic sentiment
that are discrete, rather than continuous.
you cannot express it in numbers, your knowledge is meagre and unsatisfactory." This theme, which was later repeated by many other eminent researchers, including Galton, has become one of the paradigms
Such attitudes can
from either
arise
in-
tellectual inertia, intellectual prejudice, or
both. In the customary educational back-
ground of a
the contemplated
statistician,
are usually continuous, so that
variables
he becomes most comfortable intellectually when dealing with continuous data. To avoid the "discomfort" of discrete data, the statistician
may then
sultees
eliminate discrete variables in
to
advise his clinical con-
favor of continuous metric data.
With these
substitutions of variables, the research
may
become altered from its true easily measured continuous
objectives.
The
variables
may
and numerianswer the wrong
yield statistically comfortable cally precise results that
questions.
The
intellectual
that
inertia
leads
to
of
modern
"natural
biologic
would
persistently dis-
card
or fail to analyze important data merely for the convenience of statistical calculations.
The
attitude that encourages
a displacement of discrete data
by continu-
ous variables has required intellectual rein-
forcement not just from
statistical ideology,
but from the paradigms of science
The prejudice against
discrete
been incorporated into
scientific
for almost a century,
and
the doctrine of William
knighted as Baron Kelvin.
is
itself.
data has thinking
epitomized in
Thompson,
later
exhortation
Kelvin's
entirely appropriate.
struments to study similar phenomena in
have been sustained for so long a time, however, without an additional source of scientist
science,"
During that era, a burgeoning technology had begun to produce many new instruments with which to measure physical and chemical phenomena. Scientists were avidly working to create those instruments and to apply them in measurements. Kelvin's exhortation was also promptly accepted and implemented by biologists who could apply the new in-
seemed
the
No
science.
For physicists, chemists, astronomers, and other workers in the contemporary
these inappropriate substitutions could not
support.
is:
fluids,
excreta,
organisms.
tissues,
The
and
cells
exhortation
of
was
also happily received by biometricians who were seeking numerical measurements with which to develop the principles and data
of their infant discipline.
An
important feature of Kelvin's doctrine
demanded numbers, but did not among three different ways in which numbers could be obtained. The two was that
it
discriminate
numbers are mensuraand enumeration. In mensuration, the number is observed as a dimension on an principal sources of tion
established continuous scale of values.
enumeration, the number
is
obtained
counting a group of identified
enti'
T
Problems
236
in
measurement
number
mensuration, the
enumeration
and
in
way
of getting
is
"continuous,"
"discrete."
A
third
to divide one numbers by another. Thus, a mean is usually created by dividing a mensuration by an enumeration; and a proportion, bv dividing one enumeration by another enumeration. For example, we can use numbers to say
numbers
is
of the principal types of
that a particular man weights 70.21 kg., has 4 children, and belongs to a club of which
20%
the
of
doctors.
Each
members
are
board-eligible
of these citations contains a
QUmerica] statement, but the first number is a dimensional mensuration, the second is a counted enumeration, and the third a
proportionate
ratio
or
percentage.
is
All
numbers provide quantificawhatever setting they are used, but the numbers arise from distinctively different basic forms of description. In one form of citation, the basic observation is itthree of these
tion in
number on
self a
a dimensional scale. In
the other form of citation, the basic observation
is
a
verbal
description
whose
"units" are nouns, adjectives, verbs, or ad-
We
verbs.
an entity tor,
use such words to stipulate that
is
a child or a board-eligible doc-
and we then quantify that
stipulation
and standard deviations, but the desideratum chosen for "science" was to achieve a dimension at the basic level of observation.
The concept
of
quantification
was
limited to the concept of metrification.
With
this
constricted
interpretation
of
Kelvin's "numbers," the rush
toward dimensional measurement began and has continued. During the stampede, many clinical biologists and biometricians apparentlv forgot that Darwin's and Virchow's major scientific advances had not required the use of numbers, and that Galton's biometrv was based mainly on describing phenomena and counting them, not on dimensional mensuration. Furthermore, during Kelvin's lifetime, bacteriology and radiography" were introduced as new disciplines of observation, producing data that were scientifically fundamental, precise and verbal. Although these triumphs of verbal description might have been expected to reduce the alacrity with which mensurational pursuits became a prime goal of biomedical research, the allure of measured dimensions was too great. Technology was available to provide the measurements, and Gaussian-tvpe "continuous" statistics were available to manipulate the numbers. The cur-
—
by counting a group of verbally described entities, or by expressing the counted group in proportionate percentages of some
tific
other counted group.
almost never questioned, despite the major
An
individual object can therefore be
described in dimensions or in words.
group of objects can be quantified by culating a
mean
A
cal-
for the dimensions, or
by
determining a count or ratio for what is represented by the descriptive words. Statistical
data could thus be achieved from
rent interpretation of Kelvin's doctrine has
now become
so well established as a scien-
paradigm that
its
fundamental basis
is
problems and anomalies it has caused. For clinicians and for social scientists, the consequences of Kelvin's exhortation have been more of a curse than a comfort. The variables that are most important to practicing clinicians and sociologists are usually expressed in nominal, existential, or
(at
either a route of dimensional mensurations
best ) ordinal form and can not readily be
and calculated means, or a route of verbal descriptions, counted enumerations, and
ments. There
fractional proportions.
natural scale of dimensions 22,
In the general interpretation given to the Kelvin doctrine, however, the latter route was rejected. A populational quantification by counts and ratios of verbal data
woul just
T
have provided numbers that were as "numerical" as dimensional means
converted to dimensional metric measureis no technologic device or 40
with which
a physician can metrically measure such entities as chest pain,
abdominal cramps,
back pain, dyspepsia, dyspnea,
dysuria,
or
"The arrival of x-rays was greeted with surprise and shock by many members of the scientific community. Lord Kelvin at first regarded them as an elaborate hoax. 3142
On
other discomforts encountered in the daily
work of
technologic
device
natural
or
is
no
scale
of
There
medicine.
clinical
dimensions that can be used by a psychiato metrically
trist
measure
love, fear, an-
or by a sociolmeasure the diverse at-
sured what it purported to measure? the previous verbal descriptions were
adequate so that intelligence and anxiety themselves ambiguously specified, with what could the investigator correlate his scale to validate it?
ogist to metrically
current
and exchanges that occur
interactions.
To
numbers
approval by
for
achieve
scientific
creativity?
col-
with variables that could be cited in metric effect of these
replacements has
been wryly summarized by Frank Knight in the remark, "If you cannot measure, measure anyhow." 24
The new metric
The
to use laboratory technology,
first
when
ticular scale for "anxiety" or for "personality
inventory"
better or
more accurate than
The
of validating
difficulty
scales
phenomena has been
for
a major
scientific handicap for workers in psychology and sociology. Some of the contrived scales have been highly effective (or at least widely accepted), but most of them have never received a primary validation,
measome other
pulmonary function might replace a verbal account of the severity of dyspnea.
many new
The second method was
dated)
test of
to contrive a numerical values that would provide a seemingly metric expression for a "scale" of
non-physical quality, such as intelligence,
The
is
another scale?
pos-
cited in verbal descriptions. Thus, a
had no counterpart
assessed
is
framework of multiple-choice answers to a question, do we neglect the importance of an ability in logical synthesis that might be demonstrated only with an "essay" type of response? How do we decide that one par-
non-physical
dimensional appraisal of entities that might correspond to those previously
that
intelligence
was
sible, for
vital capacity or
When
exclusively in the analytic
variables could be ob-
tained in two different ways.
surement of
place too
mensurational
would have to abandon his precision in the use of words, and would have to replace the discrete data of verbal description
The
For example, do
"intelligence"
of
tests
high a premium on vocabulary and arithmetic, at the expense of imagination and
in social
leagues, a clinical or sociologic investigator
units.
If
in-
were
xiety, hostility, or depression,
titudes
237
exorcizing llw ghost of Gauss and the curse of Kelvin
in technologic tests.
are "validated" only in
scales
correlation with an accepted (but unvali-
old scale, and many others have been non-productive exercises of the "quantophrenia" described by Sorokin. 39 (The passion for contrived but unvalidated measurements was tellingly satirized by R. E. Dickinson, 9
who
psychology and sociology. Since laboratory technology offered almost no dimensional variables that corresponded to the non-physical phenomena
"The unit proposed is the milli-helen, the quantity of beauty required to launch exactly one
of behavioral or sociologic research, psy-
ship.")
A.
'metrics' of
and
chologists
sociologists
new numerical
"scales"
began for
Even
to create
"measuring"
suggested a scale for
assessing feminine beauty:
if
all
metrification
the problems of contrived
had been
such entities as intelligence, anxiety, and family inter-relations. By assigning arbitrary
scientific progress in
grades to the answers received in question-
still
naires
or
interviews,
and by combining
and clinical psychology would be handicapped by almost insurmount-
able difficulties in selection of the population to
could express the results in ranked numbers. 3 5 "• 41 43
ceding
-
-
>
The main difficulty with these scales has been their validation. How could the investigator determine that a scale actually mea-
however,
sociology
these grades in diverse manners, the investigators
solved,
the quantification of
be measured. As indicated in prepapers 15
17 '
'
18
of
this
series,
the
an investigated population may be destroyed if it consists of a non-randc validity of
1
or chronologically diffuse cross-sectioif
it
is
examined
in a
backward
r
238
Problems
measurement
in
The major therapeutic advances few decades, however, have been technologic, not intellectual. The ad-
direction instead of being pursued forward
therapy.
a cohort. Although clinicians can constantly observe cohorts of patients who are
of the past
as
offered
therapy and followed thereafter,
do not ordinarily "treat" their subjects, and psychologists (or psychiatrists) have not studied therapy with the same fervor that has been devoted to the sociologists
vances have occurred in the surgical tactics and pharmaceutical agents of treatment, not in the scientific design and analysis of therapy. Clinicians have finally learned that
therapy must be "controlled" and quantibut have not yet developed satisfactory
etiologv of psychic ailments. Consequently,
fied,
many
methods for choosing suitable "controls" and for quantifying the appropriate variables. Clinicians have begun to accept and utilize the statistical concept that planned therapeutic experiments require a random-
sociologic populations
have consisted be chosen
of "cross-sections" that could not
randomly, and psychologic populations have often consisted of psychically ill people who were followed in a backward direction
toward
nainics."
tempted
etiology
When
perform
to
pathogenesis, the
and
"psvchodv-
psychologists
forward
members
ized allocation of treatment, but equal at-
have
at-
tention has not been given to the scientific
studies
of
concept that the treated patients and their
of the cohort
have usually consisted of "healthy" volunteers, "chunk groups," or other "rancid"
clinical responses tified.
must be adequately iden-
The consequences
of the unsatisfac-
tory methods of investigation are evident in
samples. 15 These epidemiologic difficulties
the therapeutic controversies that exist to-
have outweighed any major advances obtained from the creation of metric scales by
day at every level of treatment from such minor ailments as the sprained back and
psychologists
and
The
sociologists.
full
common
cold to major ailments such as dia-
cannot be
betes mellitus, myocardial infarction, frac-
exploited until satisfactory methods have
ture of the hip, and peptic ulcer, to catas-
been developed for getting suitable cohorts and other appropriate populations for in-
trophic ailments such as cancer.
vestigation.
ized
scientific potential of the scales
B.
The
'metrics' of clinical medicine. In
ordinary medical activities, a clinical vestigator
is
spared
many
in-
of the problems
that his psycho-social colleagues encounter
phenomena and The symptoms and
in non-physical
in cohort
signs observed as clinical data can generally be reavailability.
lated to "physical" entities,
constantly treat patients
and
horts for studies of the course
of
disease.
With
clinicians
who become
satisfactory
co-
and therapy techniques
and investigation, the treated cohorts could be analyzed in a manner that would remove or adjust the bias crued during their non-random collecof
t
classification
7,18
n the opportunity to study the clinand therapeutic cohorts of "organic sease, physicians could have been expectea to make major intellectual improvements in the scientific assessment of
ical
dance of
statistical
trials
The abun-
data and even random-
has produced
many numerical
"confidence intervals," but few therapeutic
numbers about which thoughtful physicians could feel clinically confident.
Many the
been due
of the problems have
clinician's
abdication of his
own
to re-
by delegating the design of statisticians whose abstract "models" have been inadequate for the sponsibilities
research
to
realities of clinical biology.
14
But an equally
important source of problems has been the clinician's failure to emulate the metric creativity of his psycho-social colleagues.
discipline of clinimetrics has not
been
A
es-
tablished to correspond to psychometrics
and sociometry.
Clinicians
have become
metrically oriented, but not while observing patients.
The metric data have come from
paraclinical tests of patients' blood, urine,
and other substances observed
in the labo-
ratory. If clinicians
had respected
their clinical
On
239
exorcizing the ghost of Gauss and the curse nf Kelvin
observations, a collection of suitable "scales"
A
would have been created,
laboratory test of exercise does not indicate
tested,
and
vali-
dated for identifying the existence or
as-
phenomena as pain, digestive distress, dyspnea, and all the other "physical" symptoms encountered in medical practice. The scales, by now, could
sessing the severity of such
have been standardized in direct clinical activities, and would be available today for therapeutic surveys and
trials.
challenges of their
own
the
of
metric
data
from paraclinical technology. An array of dimensional numbers far beyond the fondest expectations of any Kelvinistic dream has been provided by the technologic ability to measure the constituents of to
fluids;
the
assess
physiologic
magnitudes of structural volume, mechanical
pressure,
conduction,
electrical
count, and survival time does not indicate whether a patient with cancer is alive and vibrant, or miserable and vegetating. The
does not indicate a patient's cardiac func-
available
diverse
dyspnea or angina pectoris in the circumstances of daily life. The assessment of roentgenographs tumor size, white blood
his
scientific
observational data.
plethora
an isolated
in
assessment of electrocardiographic changes, angiographic anatomy, or digitalis intake
Rather than confronting and solving the problems of clinical data, clinicians have substituted
performance
how-
Instead,
have evaded the
ever, clinicians
patient's
and
tion.
In these examples and in numerous other
needed
instances of the data
for evaluating
modern therapy, the necessary verbal
vari-
have been replaced by numerical data from paraclinical variables that provide the right measureables of clinical observation
ment
for
wrong
the
The
entity.
conse-
quences of this "substitution game" have been inadequate clinical science, because neither the pre-therapeutic nor post-therapeutic state of the patient
and dehumanized
is
sufficiently
muscular contraction; and to count the up-
specified;
take of radioactive substances. All of these
because the patient himself is deliberately ignored as an important source of informa-
phenomena
are generally observed,
ever, not in the patient,
how-
but in substances
derived from the patient and examined by
methods
other
than
clinical
skills.
To
achieve the phantasmagoria of metric data in
modern medicine,
their
have shifted medium of observation from the pa-
tient in
clinicians
the bed to the substance in the lab.
This deliberate dehumanization of the
observed variables
was inspired by the
quest for "better science" and
duced
results
it
has pro-
that are often scientifically
absence of suitable attention to patients has produced major defects in both therapeutic science and therapeutic care. Some of the most excellent.
Nevertheless,
the
care,
clinical
tion for analysis.
The
basic reason for avoiding the verbal
descriptions of clinical data has been that
they are subjective, ducible.
Much
"soft,"
and non-repro-
of this difficulty could be
promptly removed, however, if clinicians gave serious attention to improving the methods of observation and classification for the data. By shunning the intellectual seduction of inadequate metric substitution,
by
criteria
data,
12
by
suitable
establishing for
interpretation
taxonoric
of the
"soft"
identifying the observational de-
therapy can be
tails necessary for the interpretations, and by analyzing and reducing the inconsistencies with which these details are ob-
discerned only in the patient and cannot
served, clinicians could preserve the im-
be appropriately measured with paraclinical data. Neither the initial state nor the subsequent state of a treated patient are properly identified if the analyzed variables do not include such clinical features as
portant "soft" data while "hardening" their
crucial aspects of clinical
iatrotropy,
co-morbidity, and the cluster,
sequence, and chronometry of symptoms.
11
scientific
quality.
The
various
nominal,
existential, or ordinal categories of classifi-
cation for these data
would then produce
the necessary scales of clinical medicine.
Many
such scales have been create
recent years.
Some
of
them are usee
1
240
Problems
in
measurement
identification;
study, 44 the conventional clinical scale for
and others deal with grading the severity
neuropathy was replaced by a metric measurement performed with a biothesiometer. In exchange for the statistical
diagnostic
(or "screening")
of an entity either for prognostic or thera-
peutic purposes. Among such scales are: the various ratings of mental and physical impairment issued by committees of the
American Medical Association 7 the "Apgar score" of the condition of newborn babies'; and the "Kate index" of independence in 23 Other recent activities of daily living. scales have dealt with the severity of such 34 asthma/ the neonatal entities as tetanus, respiratory distress syndrome, 20 osteoarth37 After ritis of the knee,-' and alcoholism. ;
many
previous "statistical" investigations of
clinical
course and treatment, scales
"criteria")
such sis
have finally been proposed for fundamentals as the diagno-
scientific
of lupus erythematosus,8 the severity of
and the quality of survival
renal disease, in
(or
patients with breast cancer. 38
directly validated, either in construction or
because no satisfactory scienstrategies have been developed for the validations. For example, no penetrating analyses have been given to
in application,
and
satisfaction of these
the scientific chagrin that later came when the biothesiometric baseline data could not
be obtained for more than half the cohort, and when the measured results in the toes were found to be non-reproducible. 16 All of the necessary work in "clinimetrics" will require intensive attention and thought
best created
is
by "additive weights" or by
"Boolean clusters." In addition, almost
all
clinicians
tific
importance of the subject and
ticians
another.
in
established
severity
The scope
greatly restricted.
for
from one
of the scales
Many
grading state is
to
still
important clinical
phenomena, including the assessment of what is really accomplished in the care of patients, 19 have not yet received appropriate investigation and scales for evaluatic A further significant problem is that man^ ajor therapeutic trials have been (or ar now being) performed with the necessar scales either omitted or abjured. For example, in the celebrated UGDP
who
lack fa-
process of analyzing and validating
the results can be greatly aided, however,
who
by
statisticians
to
welcome the
are enlightened
enough
intellectual challenges of
categorical data, perceptive
enough
to en-
courage the clinical investigators to grapple with the problems, and wise enough to avoid misleading the investigators into the specious allure of metric substitutions.
In using verbal data to yield discrete
can be enumerated and
that
categories
quantified with proportionate ratios, clini-
meaning of need for numerical expressions. But a sublime paradox cians
can
fulfill
the
larger
Kelvin's doctrine about the
modern technology can enable the defor numbers to be satisfied even at
mand
been
other personnel
or
nuances of both the observational procedures and the data.
of
not
who
miliarity with the clinical
a single point in time. Satisfactory scales
have
respect the basic scien-
have the trained observational skills to study it. The work cannot be done by statis-
the scales refer to the patient's condition at
transitions
who
from
statistical
the issue of whether a composite scale
measured dimensions,
however, the investigators had to accept
The
These and other new scales are all based on categorical arrangements of information that is primarily verbal. Although the proposed scales represent important scientific progress, the progress is scant and overdue. Almost none of the scales has been
tific
diabetic
the fundamental level of description. At
about the same time that Kelvin's exhortawas achieving wide popularity, Her-
tion
man
Hollerith
introduced a punch card
system with which data could be coded for mechanical processing. First used for the United States census of 1890, the coding
now
evolved into the familiar
(IBM)
cards that are used today
system has Hollerith
managing data with a digital computer. The tactics of expressing discrete data with numerical "co-ordinates" and coding digits have been described elsewhere, 13 and will for
On
exorcizing the ghost of Gauss and the curse of Kelvin
The main
not be further discussed here.
now
point to be noted
that such verbal
is
data as "substernal chest pain, provoked by
by rest" can be faithfully and precisely represented when a coding number like 137406 is entered appropriately into the columns of a Hollerith card. The sublime paradox of computer automation, therefore, is that its system for exertion, relieved
coding data offers clinicians the opportunity to re-humanize clinical science instead of using the inanimate technology to poten-
dehumanization. By
tiate further
restoring
and standardization to clinical data that have hitherto been neglected because they could not be expressed crucial
attention
the Hollerith coding system
numerically,
can
permit the patient rather than the
laboratory to gain supremacy as the center
As a the number 137406
of attention in clinical science.
group of coding
digits,
has no more dimensional connotation than
number
a telephone
or a zip code, but
it
provides a numerical expression for a precise verbal description, digits
it
that
usually
Gaussian curve, and the term normal can then be liberated and returned to its primordial medical meaning. This transfer, however, will not remove the poltergeist of neo-Gaussian variance, an ambiguity that
in the
confounded the fundamental
has
and an observational process measurement. The scientific distinctions that separate homogeneity and homogeneity of variance will be discussed later in this classification
of
series.
The computer's capacity also
another
neo-Gaussian
stract inferences
ance.
An
entirely
tical decisions
for
suffice
biostatistical
the
available,
digmatic
allegiance
made
has been
Apgar, V.: Proposal for
new method
infant.
R.
N.:
Clinical
Boston, 1971, Little, 3.
phon
1970,
The Gry-
Chai, H., Purcell, K., Brady, K., and Falliers,
Therapeutic and investigational evaluaAllergy 41:23-36, J.
tion of asthmatic children,
J.:
1968.
the necessary changes. 5.
Church, C. N., and Ratoosh, P., editors: Measurement: Definitions and theories, New York,
6.
Cole, A.
•
The curse of Kelvin can be exorcised, therefore, by recognizing the paramount importance of clinical observations, by augmenting the precision with which the observations are made and verbally described, by developing operational
1959, John Wiley J.,
of
St.
Andrew's,
7.
September,
1968,
London,
Committee on Rating of Mental and Physical Impairment (American Medical Association).
The committee
ing with different
home
Numerical taxonomy. Pro-
1969, Academic Press, Inc.
and by using computer coding techniques to express those categories in numerical digits. The ghost of Gauss can be elimspiritual
Sons, Inc.
ceedings of a colloquium held in the University
criteria to con-
new
&
editor:
vert the descriptions into precise categories,
to a
J.,
Press.
C.
by transfer
statistics,
Buros, O. K., editor: Personality tests and reviews, Highland Park, N.
4.
laboratory
J.
Brown & Company.
tunity,
inated
of evalua-
Anesthes. Analg. 32:
260-267, 1953. (Further details reported in A. M. A. 168:1985, 1958.)
only real
prevent clinical
that
newborn
tion of
from recognizing the opporgrasping its challenge, and creating •
by
References
investigators
•
possible
for discussion at a later date.
and constricted para-
inertia
based on observed varinew approach to statis-
metric
problems that remain are the entrenched intellectual
ap-
zations of data in a Monte Carlo procedure. The new statistical paradigms provided by Monte Carlo concepts will also be reserved
2. Barnett,
chronic
this
remove the
the computer's ability to perform randomi-
remedy malady
now
to
eidolon:
praisal of "significance" according to ab-
1.
for
for calculation
opportunity
the
offers
measurements), and it can serve as a prophylactic agent for avoiding the transmogrified data of "metric madness." Since the is
differ-
ences between an intellectual process of
has six meaningful
(in contrast to the three "significant
figures"
241
body systems.
A
list
appears
most recent publication in: J. A. M. 213:1314-1324, 1970. Diagnostic and Therapeutic Criteria Co tee of the American Rheumatism Ass in the
8.
has issued 13 publications deal-
A
Problems
242
measurement
in
modern physical
Section of the Arthritis Foundation. Preliminary lupus
Quantification, Indianapolis, 1961,
erythematosus, Bull. Rheum. Dis. 21:643-648,
rill
27.
1971. 9.
Dickinson, R.
A
E.:
letter
Feb. 23, 1958. Quoted
The J.
10.
The Observer,
in
Atkins, H.
in:
J.
three pillars of clinical research, Br.
R.,
J.
M.
A.
L.,
and Keating.
and the ghost of
A. 211:69-75, 1970.
31. Ibid. p. 59.
consultant,
— and
34. Phillips,
poll-bearer, Ci.in.
of
Program (UGDP) study, MACOL. Ther. 12:167-191, 1971.
betes
A.
Clinical
R.:
Clin
Clinical
R.:
20.
X.
39. Sorokin,
1971.
Urgently needed: A way to meareally help patients, Resi-
syndrome, Lancet 1:808-810, 1969. 21. Gresham, G. E.: A method for the evaluation and classification of symptomatic and functional distress
22.
in
osteoarthritis of the
knees, Arthritis
Rheum. 13:320, 1970. Hamilton, M.: Measurement for what? Roy. Soc. Med. 63:1315-1319, 1970.
Proc.
Jaffe, M. W.: Studies of illness The index of ADL: A standardized
measure of biological and psychosocial function, J.
A.
M. A. 185:914-919, 1963. Quoted in footnote, p. 34,
24. Knight, F.: 25.
k
.
alg<
26.
and Feinstein, A. R.: Computer-aided -;is: II. Development of a prognostic Arch. Intern. Med. 127:448-459,
1971. 26.
Kuhn,
The
P.
H. A.: Principles 1963,
in modern Henry Regnery.
foibles
1956,
41.
J.,
42.
so-
measurement, and
Handbook of experimental psychology. New York, 1951, John Wiley & Sons, Inc., pp. 1-49. Stouffer, S. A., Guttman, L., Suchman, E. A., Lazarsfeld, P. F., Star, S. A., and Clausen, J. A.: Measurement and prediction, Princeton, N. S. S., editor:
1950, Princeton University Press.
S. P.: The Life of Sir William Thomson Baron Kelvin of Largs, London, 1910, The Macmillan Company. (Cited on
Thompson,
p.
59 of
27.)
ref.
43. Torgerson,
W.
New
S.:
Theory and methods
York, 1958, John Wiley
&
of
Sons,
Inc.
Group Diabetes Program. A study
of the effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes. Part I: Design, methods, characteristics. Part II:
45. Walker,
function of measurement in
H.:
statistical
and baseline
Mortality results, Dia-
liams
&
Studies
in
the
history
of
the
method, Baltimore, 1929, The Wil-
Wilkins Company. D. B.: The impact of rigid
defi-
on
Biol.
46. Zilversmit,
nitions :
Quantifying
betes 19:(SuppI. 2) 747-830, 1970. ref.
N.,
pi
Fads and
P.:
44. University
and
in the aged.
and Sneath,
psychophysics. in Stevens,
scaling,
23. Katz, S., Ford, A. B., Moskowitz, R. W., Jackson, B. A.,
L. E.:
Mod. Med. 37:188-189,
40. Stevens, S. S.: Mathematics,
XI.
biostatistics.
prognostic score for use in the respiratory-
status
Cancer 26:650-655, 1970.
Hollister,
ciology, Chicago,
dent Staff Physician 17:139-145, 1971. Gomez, P. C. W., Noakes, M., and Barrie, H.:
A
Quality
of numerical taxonomy, San Francisco,
how much we
sure
J.,
F.:
who have had
patients
1969.
Sources of 'chronology bias' in cohort statistics. Clin. Pharmacol. Thek. 12:864-879, 1971. 19. Gilson, J. S.:
and
38. Sokal, R. R.,
statistics,
Pharmacol. Thek. 12:704-721, A.
tetanus,
W. H. Freeman & Co.
biostatistics.
Sources of 'transition bias' in cohort 18. Feinstein,
and Robbins, G.
among
radical mastectomy,
An
Croup DiaCl.iv. PHAR-
of
classification
alcoholic impairment,
analytic appraisal of the University
17. Feinstein,
A
A.:
survival
37. Shelton,
134-150, 1971. 16. Feinstein, A. R.: Clinical biostatistics. VIII.
Mc-
1959,
307, 1971.
The
I'iiahmacol. Thkr. 12:
York,
tion). Criteria for the evaluation of the severity
rancid sample, the tilted target, and the medical
L.
36. Schottenfeld, D.,
15. Feinstein, A. R.: Clinical biostatistics. VII.
New
of established renal disease, Circulation 44:306-
11:
898-914, 1970.
Tests and measurements: Assess-
prediction,
Lancet 1:1216-1217, 1967. 35. Report of the Council on the Kidney in Cardiovascular Disease ( American Heart Associa-
the responsibility of
Ther.
I.:
Graw-Hill Book Co., Inc.
I.
I'iiahmacol.
Ci.in.
Ibid. p. 78.
33. Nunnally,
Clinical biostatistics. VI. Sta-
"malpractice"
tistical
a
Taxonorics.
Med. 126:1053-1067, 1970.
tern.
14. Feinstein, A. R.:
p. viii.
ix.
ment and
Formulation ol criteria, Arch. Intern. Med. 126:679-693, 1970. Feinstein, A. R.: Taxonorics. II. Formats ami coding s\ stems fur data processing, Arch. InR..-
structure of scientific revolu-
Chicago, 1970, University of Chi-
30. Ibid. p. 6.
The Williams & Wilkins Company.
12. Feinstein, A.
The
S.:
cago Press, 28. Ibid. p.
.52.
11. Feinstein, A. R.: Clinical judgment, Baltimore.
1967,
Kuhn, T.
29. Ibid. p. 5.
Guillier, C.
normality,
Health,
R.:
Gauss,
Bobbs-Mer-
Co., pp. 31-63.
tions, ed. 2,
B.:
Mod.
2:1547-1553, 1958.
Elveback, L. F.
I
science, in Woolf, H., editor:
criteria for the classification of systemic
scientific
thinking,
Med. 7:227-247, 1964.
Perspect.
CHAPTER
17
The derangements
of the
'range of normal'
According to certain contemporary stan-
Chamberabnormal. At more than
dards, the basketball player Wilt lain
grossly
is
seven
height
isolated and the correlated. In the isolated meanings, the idea of normal is univariate,
emerging from the direct partition of an
at least five stan-
array of numbers that are the values of
dard deviations away from the mean. Yet no one, except possibly a basketball op-
a single variable, such as height, age, or
ponent,
ings, the idea of
feet, his
is
has proposed that Chamberlain's
serum
mean-
cholesterol. In the correlated
normal
is
bivariate.
The
normality be restored by amputating his
values for the single variable (height, age,
legs.
or
This outrageous vignette helps illustrate the confusion that can arise from the cur-
serum cholesterol are regarded as normal or abnormal according to the way they relate to some other variable, such )
rent chaos of medical publications dealing
as state of health, genetic fitness, or prog-
with
nostic expectations.
"range of normal".
the
Despite
all
the
proposed new units of measurement
In the
and
all
spawned by com-
normality
puters,
been
the calculations
the concept of a "range" has not
suitably
population,
defined
in
either
data or
and the medical meaning of
"normal" has been
lost
in
the shuffle of
meaning
In
a
series
Murphy
:s
of 'normal'
masterly papers,
of
E.
A.
demonstrated the ambiguity and inconsistency with which the concept of "normal" is applied in modern '
has
Murphy
medicine.
meanings
has noted at least seven
for the word normal. For most practical purposes, these mean-
different
ings
delineated according to what
is
be conventional,
to
customary,
average, or habitual in the array of nu-
merical
With
values.
this
might use such expressions
we
approach,
normally
as "he
begins work at 9 am" or "families these days normally have two children". When the isolated approach receives rigorous
statistics.
A. The
found
is
approach, the zone of
isolated
:
can be divided into two groups: the
quantification, the demarcation of normal
becomes an array
the
act
of
of pure
numbers
After
statistics. is
assembled,
a
used to choose the boundaries of the groups that will be included or excluded as normal.
statistical
In of
the
normal
principle
is
approach,
correlated is
the
idea
medically referred to some
innocuous, harmless, or ideal situation of health. Under the same name, "Clinical biostatistics
15:528, 1974.
—
chapter originally appeared as XXV11." In Clin. Pharmacol. Ther. this
a
If
the
particular
current state of
called abnormal;
if
ill
it
number
health,
it
is
implies the
impli usi 1
243
Problems
244
some
of
number ma)
future ailment, the
be called a
For example, the normalit) in blood
factor.
risk
standards
current
measurement
in
ol
pressure were established
jtoss
or microscopic examination of anatomic structures, not by dimensional measurements of chemical substances. The
correlated
clinician gets prime- help in these diagnoses
univariate
manner. The decision that a diastolic blood pressure ol mon than 90 01
To conclude action"
that an "adverse drug re-
occurred
has
EVENT.
requires
a
careful
dissection of the intricate mixture of constituents contained in
each of these three
entities. 1.
The
after
event.
The events
that are noted
use of a drug can be classified as
desirable,
negligible
(or
innocuous),
or
— that
undesirable. Certain undesirable events can
were not discovered during the pre-marketing investigations? In this essay, I should
pain associated with an injection. Other
rhythmias or methotrexate in psoriasis
like to
culties
note some of the fundamental
and
to
discuss
some
of
the
diffi-
bio-
This chapter originally appeared as "Clinical biostatistics
XXVIII. The
—
problems of pharmaceutical surveillance." In Clin. Pharmacol. Ther. 16:110, 1974. hiostatistical
be intended or anticipated, such
as
the
undesirable events, which are not anticipated,
are
the
contenders
for
receiving
the designation of "adverse drug reaction,"
Before
any other planning can begin, a consistent mechanism must
standardized,
271
272
Problems
be established
measurement
in
For
designating the desir-
observed events and for ability deciding which ones are the adverse-reaction candidates. The current absence ol such a mechanism, as noted later, is a the
of
any clinical or aimed at designing an
obstacle
Fundamental
biostatistical efforts
to
effective process of surveillan<
over-
simplified aspect ol the surveillance process
the idea that the events occurring alter
is
the
use
a
i
drug can
regularlx.
phenomena drug
that
OCCUT while or alter the
taken can produce effects whose
is
— max-
— co-morbidity
to the associated diseases that
addition
in
with
patient
myocardial
Demography tures
refers
age.
as
to
underlying
These-
distinguish
who
son
the
another drug or the supei imposition ol a separate
the
takes
ol
be
included
exposure
arrhythmia
procedures or non-
pharmaceutical therapy that may be lowed by untoward consequences.
To evaluate
the concomitant
fol-
phenomena
and therapeutic reasoning. As noted later, no standardized, consistent mechanisms exist for performing these judgments.
The person. The human use
3.
also does
not take place
situations.
Even
if
in
of
drugs
standardized
people have the same
diagnosed disease, they max
differ in their
Thus,
two patients
basic
clinical
states.
may both have acute myocardial infarction, but one may have chest pain alone whereas the other may have chest pain, respiratory distress,
and
Even
txvo people have identical diag-
if
heart
a major cardiac arrhythmia.
such personal
fea-
occupation,
and help per-
the-
They
drug.
max' also
an
receiving
congestive
of
cause
possible
a
worsening of
a
is
an
of
when
in a patient
treatment
the
lor
sources
Thus,
reaction.
must
given, that
is
possible
as
failure,
the
of
the
initial
congestive heart failure rather than a action
max cause an adverse reaction, intrijudgments are needed in diagnostic
that
cate
has
also
characteristics
arrhythmia develops digitalis
that
to diagnostic
a
alone
be prognostic harbingers of future events,
was not present when the "index" drug was begun, and the patient's ailment
who
baseline state of
undesirable
drugs,
disease;
family status.
nomena
use
re-
are present
infarction
sex.
race-,
occurring after the drug
the
demog-
people
principal
from a patient
different
is
the
to
cause must be differentiated from the effects ascribed to the drug. Among those pheare
and
the
differentiate
ceiving the same drug. Co-morbidity refers
be asso
Man) concomitant
ciated with that drug.
variables
raphy
hypertension, diabetes mellitus, and gout.
e.
The drug. Another frequently
2.
tant
to
the
receiving
Similarly,
digitalis.
dementia develops
cardiac
disease,
possible cause of the dementia arterial
rather
insufficiency
when
an elderly patient
in
for
digitalis
re-
a
is
cerebral
than
digitalis
toxicity. B.
The
statistical rate
After these different aspects of the data
we
have been disentangled,
can contem-
plate a statistical rate as /
ADVERSE
\
/
(^REACTIONS ) I
The numerator fully
/USE OF\
DRUG
\
of this rate
determined since
it
)'
must be care-
and causal
identification, association,
the
represents
eval-
The
de-
noses, such as acute myocardial infarction,
uation of the events just cited.
and
nominator of the rate must also be care-
identical clinical states, such as chest
and respiratory
pain
distress,
drugs
the
may be
given for different clinical
cations.
One
for
the
be
max
patient
chest
given
pain;
the
receive
same
drug
Drug patient
other
the
indi-
for
the
1
and
addition to diseases, nical indications,
determined since
clinical
txx'o
states,
other impor-
it
the
represents
risk taken in the exposed patients. Another reason for precise identification of the denominator is to permit valid
the
extrapolation
iratory distress.
re
fully
association of the numerator events xvith
of
the
results.
Unless
xve
knoxv the particular tvpes of patients
whom
adverse reactions occurred,
we
in
can-
The
draw about what
clinically
not
do
to
in
useful
conclusions
future
prescription
When we
consider the process of sur-
however,
veillance,
talking
not
we' are
about either the numerator or denominator of this rate. Surveillance
(the slash
in the virgule
virgule
what happens mark) that sepais
numerator and denominator. The
rates the
Despite some
be noted
will
sent an
of the drug.
represents
the
time
interval,
be-
tween use of a drug and occurrence of subsequent events, during which we can impose the process of observing, detecting, evaluating, and recording those events.
273
pharmaceutical surveillance
difficulties of
major shortcomings these systems
later,
enormous achievement
that
repre-
in the tech-
nology of pharmaceutical surveillance. One valuable demonstration has been the con-
be made by nurses,'
that can
tributions
pharmacists, 17 and other non-physician per-
sonnel in collecting data. Another noteworthy feature has been the documented importance of making planned, continuous observations 28 rather than relying on ad
A
hoc, sporadic reports.
third substantial
contribution has been the creation of suitable data formats for coding and storing
and
Main- discussions of surveillance are de-
much
voted mainly to these "virgule activities," with emphasis on the various investigators,
the development of computerized or other
administrative mechanisms, and computer systems used for the acquisition and stor-
when
age of data. These activities in the process of surveillance constitute its technology
obtained served
to
the arrangements of medical setting, per-
toxicity
and appraisals
sonnel,
and data-gathering
tactics that al-
low the desired information
The more basic
be acquired.
to
scientific strategies of sur-
however, depend on decisions about the numerators and denominators: what kind of data to gather, what kind of patients to observe, and what principles veillance,
C.
tical
the
research of the past decade has been
development -". as. as. 31, 35,
cal
tive
during
systems,
ment
of
be
will
future
work
that
tactics
background
for
that
any
in surveillance activities.
institution-based surveillance system,
tients
the
who
are referred or self-referred to
institution
and can study only the
For the surveillance systems that depend on a population of hospital in-patients, the
-
and
pharmaceuti-
effects of
duration of observation usually as long as the patient
different
drugs
employed and
a
as
analog
an
of
for noting
may
mechanism Phase
III
demographic
influence the action
in
is
These drawbacks would not is
-'
ultimate
are authorized or ordered at the institution.
medical
frequency of associated adverse reactions. In some instances, a surveillance system
factors'
the
of
technologic
invaluable
ability
of drugs.
drugs and drug-
have led to the develop-
useful
what
clinical trials-";
of possible efficacy
Regardless
they
tem's
performing
possible
role of these institution-based surveillance
and the
for
of
These systems, which
ii
information that was hitherto unavail-
has also been
warnings
for a variety of different
interactions.-
have
surveillance
the
provide
able about the frequency of institutional of
inter-
kinds of pharmaceutical preparations that
preparations, have obtained quantita-
prescription
used
be and
'"
institutions.'
are intended to provide a continuous scrutiny of the use
can
that
tabulated
of "surveillance systems"
at several different U-i8,
pharmaceu-
in
are
however, is institution-based. The investigators can observe only the kinds of pa-
The technology of surveillance
A major accomplishment
data
the
-•"'
preted. In addition, of course, the results
An
to use for evaluation.
methods''
analytic
information,
necessary
the
of
to
lasts
only
the hospital. affect
a sys-
draw conclusions about
happening
in
the
institutional
would create serious setting, but limitations in answering more general questions about the surveillance of a wide variety of people using a wide variety of drugs for a wide variety of durations. The fundamental problem of contempothey
rary surveillance, however,
is
not the re-
Problems
274
scope
stricted
measurement
in
observed
the
of
and durations
drugs,
The
systems.
surveillance
problem
patients,
institution-based
in
fundamental
the absence of a standardized,
is
mak-
consistent, reproducible procedure for
ing is
decision
the
that
particular
a
event
an "adverse drug reaction.'
primary
The
causal
(or
mentioned and
sembled the
ol
an act
as
judgment
No
as
regis-
later,
each
individual clinical
ol
clearh delineated and overt-
method
specified
t
"adverse reactions" was
tabulated
diagnosed
l\
mentioned
he
will
ju
reports
the
ol
all
the various surveillance
at
that
tries
in
systems
operational
ol
pro-
making the decisions. The investigators ma) reguIarly supph definitions and classifications cedures
of
hern developed
has
for
"adverse reactions." hut a defined
the
classification
quite
is
from
different
an
operational identification.
We of
can define blood sugar as the amount
monosaccharide.
a
present
100
in
ml
of
.11, .(),..
whole
that
blood,
is
but
definition does not provide an opera-
this
tional identification of
operational a
(.'.
of
set
blood sugar. For an
identification,
sequential
we would need
instructions,
such as
those that might be found in a laboratory manual describing a chemical method of measuring blood sugar. The directions would tell us to add a certain volume of blood to certain volumes of designated reagents under certain physical conditions; to place the resultant mixture in a spectro-
photometer; using
and
certain
to
convert the
pre-determined
reading
numerical
The result would be the value of blood sugar, operationally identified.
factors.
The medical and pharmaceutical literacurrently contains many definitions
ture
drug reactions according to mechanism, existence, and severity. Their presumptive mechanism is that
classify
adverse
usually classified as pharmacologic, allergic. or
idiosyncratic.
The pharmacologic
re-
actions are further subclassified as primary
excess effects,
due
to
overdosage or over-
injection.
an ad-
to
often classified as definite
is
probable,
"causative"),
possible,
or
moderate, or mild.
severe nonfatal,
these definitions and classi-
of
provides an operational
fications
surveillance
the
ol
all
of
drug
of a
on non-
effects
sites
at
relation
Hut none
reactions
or
doubtful (or "coincidental"). The severity of an adverse reaction can be cited as fatal,
In
targets
verse reaction
The identification of adverse
D.
and secondary
effectiveness;
cation or diagnostic algorithm'
identifi-
for decid-
ing whether a particular undesirable event
was
clue
other
accused
the
to
drug,
an
to
evolving course
to the state,
drug,
interaction
some
to
drugs,
of
the main clinical
of
the evolving course of a baseline
to
co-morbid
state,
to
a
co-morbid
state,
or
to
superimposed new
some intervening
diagnostic or other therapeutic procedure.
A
report
actions to
from a Registry of Tissue ReDrugs is the only reference I
have been able formal
to find in
which deliberate
been given diagnostic reasoning used in this
Among to
decisions
criteria for
cation
are
main
in
to
the
activity.
might be included flow
the
chart
or
making an operational identifithe known characteristics of clinical state and co-morbid
known
the
state,
data that
the
justify
the
has
attention
patterns of response to
the index and associated other drugs, the
laboratory evidence of the drug's concentration
the patient,
in
the time relation-
between the adverse event and the intake of drugs, and the results of such
ships
additional tests as cessation
lenge")
of the
drug and
(or "de-chal-
its
resumption
(or "re-challenge"). In
the
absence
operational
stated
of
identification
actions, investigative
for
work
will continue to lack a
of
adverse
re-
surveillance
fundamental ingre-
dient of scientific evidence.
The "adverse
the basic element under
reaction," as vestigation,
in
methods
is
in-
not identified with precise,
We have no idea of the amount of variability among
distinctive, reproducible criteria.
the
ment
physicians
whose
nondescript
judg-
used to decide whether an observed event is or is not an adverse drug reaction. There have been no procedures is
The
tor standardization;
no
tests oi consistency;
and
no
After
we work our way through
assessments
difficulties of
reproducibility.
of
the
all
the
of
statistics,
'
we
find
the
that
decision-making mechanism for identifying reactions
adverse
— for
making diagnoses
such as whether an episode of vomiting
due
is
underlying heart disease,
to digitalis,
psychic tension, or spoiled food
on
the
vagaries
of
—depends
judgment
clinical
of
an array of unstandardized physicians.
The
main step towards developing pharmaceusurveillance will not be in creating first
technologies
additional
The
step
first
methods
for
is
surveillance.
of
to arrive at reproducible
identifying an adverse drug
Epidemiologic strategies of
surveillance
Like other epidemiologic challenges the
cohort.
discernment of cause-and-effect
in
rela-
is
people
of
assembled and followed to note the occurrence of the adverse reactions. This is the type of procedure used in the hospital surveillance systems mentioned earlier.
Retrolective coliort. In this approach,
b.
the numerators and denominators consist
same type
of the
hort
—the
of data used for a co-
incidence of
re-
possible epidemiologic strategies con-
complex, almost bewildering, array
of different
numerators and denominators,
and methods of getting The reader is hereby warned that
"control" groups,
section of the text, as befits the intri-
population of exposed or nonexposed people but the data for the compared groups are assembled retrolectivch ," after the adverse reactions have occurred.
Having noted some events that are believed be adverse drug reactions, the investigator begins the research by collecting exist-
to
of
the
subject,
is
neither
simple
from patients' medical clinical course of a group
ing data
drug.
who have
received the suspected
By inspecting
the data for the post-
drug course
of those people, the
comparison, the investigator assembles the
recorded data of a similar group of people,
Cohort procedures. The cohort approach is the most scientifically logical way of answering a question about cause and effect; and the incidence rates that emerge from a cohort study are direct and simple to understand. The denominator consists of the number of people exposed to the "cause," which is the drug under investigation. The numerator consists of the number of those people who developed the '"effect," which is the cited adverse re-
notes the rates of the
1.
action^).
The
denominator
is
ratio of this
an incidence
reactions
drug. This rate
is
numerator and rate, denoting
per people
incidence of similar adverse a
taking
the
then compared with the
group of people treated
in
reactions
investi-
gator determines rate at which the suspected adverse reactions occurred. For the
nor easy to follow.
adverse
adverse reactions
in a defined
of people
this
some other way are
or being treated in
The
cacies
prolective"
adverse reactions
—usually records —for the
data.
a
given and before any have occurred. Groups taking the investigated drug
before treatment
quires a comparative assessment of rates.
pharmaceutical surveillance
tionships,
tain a
In
—
reaction. E.
Protective
a.
study, the investigators begin the research
a valid biostatistical science of tical
275
manner. Several different procedures can be used for acquiring data about incidence rates in a cohort.
majesty of the computer print-out and the glory
pharmaceutical surveillance
this
some other manner, and same reactions in controlled group. The data for the treated
in
control group can be "concurrent," derived
from people treated in some other way during the same calendar time interval in which the suspected drug was taken; or "historical," derived from an earlier group of patients with the same medical condition.
A
retrolective cohort with
was
a "historical
procedure employed when the hazards of thalidomide were first pointed out in the English lancontrol"
guage
group
literature
the
by W. G. McBride.
'-
With-
out any elaborate research grants or epi-
in
demiologic
some other
scribed his
flamboyance,
work
in a letter,
McBride
de-
written to the
Problems
in
measurement
The Lancet,
for of
that
model
a
is
of
Congenital
1.5%
proximately I
abnormalities of
present
are In
babies.
who were
in
.
.
.
.
agent notes
during
this
group of patients receiving the suspected drug is not actuall) assembed and counted as the denominator of a cohort. a
number
Instead, the
drug
is
estimated
people taking the
of
from
its
sales
supplied by the manufacturer. ator
for
denominator
this
suspected
of
been
is
figures.
The numernumber
the
have
adverse reactions that
reported
geographic
the
in
region
which the estimated denominator perThe data about the adverse reactions
to
the
suspect that the
assembling data about
a suspected pharmaceutical
introduced, secular
the
trend
investigator
for
or other usage of the agent. tant
circumstance,
was
may
to an alteration in treatment
when
the time
women .
due
is
for the disease. After
multiple
t
(was) almost 20%.
Estimated cohort. In
c.
.
ap-
months
t
babies delivered
given the drug thalidomide
pregnancy
in
recent
have observed that the incidence
severe abnormalities
trend, an investigator
change
thinking and reporting:
curves
sales
The concomi-
secular trends in both mortality and
drug usage are then associated effect conjecture.
The
such
mortality
secular
a
for a cause-
drawn from was
inference
association
the basis for beliefs-' that aerosol broncho-
had led to an increase in deaths from asthma in England and Wales. /;. 'Event' trohoe. This approach con-
dilators
sists
of
trohoe
1
-
modification
a
the
of
classical
The
(or "case-control") study.
in-
begin by assembling a group
vestigators
of people (or "cases")
who
are
known
to
2. Other procedures. For all of the three methods just described, the frequency of adverse reactions was calculated as an
have had the particular event that is suspected of being an adverse reaction to the drug. In this group of "cases," the investigators then determine the number of people who had previously taken the accused drug. For comparison, the investigators determine the previous usage of that drug in a "control" group of people selected because they have not had the event under scrutiny. The prevalence rates of the drug's usage in the "cases" and
incidence rate, with appropriately exposed
the "controls" are then compared directly
tains.
can
be
collected
in
a
registry
|
as
de-
by the manufacturer of the drug. The rate in a comparative group is determined from "historical controls" or from analogous data assembled for some other drug that has scribed
later
or
)
directly
the same type of clinical usage.
(or non-exposed) people in the denomina-
and the corresponding adverse reac-
tors
tions
in
the
numerators.
methods that follow, the data
is
altered.
In
The number
receiving the drug
is
all
of
the
logic of cohort of
people-
not determined and
cohort incidence rates are not calculated.
or
compressed
into
a
"risk
ratio."
The
"cases" are usually selected from in-patients a hospital, and the "controls" can be chosen from other patients in the hospital or from suitable sources outside the hosat
pital.
The population studied
proach
is
in
this
called an event trohoe
ap-
(rather
Instead, the decisions rest on several other
than a disease trohoe) because the cases
types of epidemiologic rates.
and controls are chosen according
a.
Secular mortality associations. In this
procedure,
the
basic
research
data
are
by the Bureau of Vital Statistics some appropriate counterpart), when
issued (or
publishes the annual rate of death attributed to a specific disease in the general
it
population.
The
pattern noted in a calen-
dar sequence of these rates
is
annual
mortality
called a secular trend. Observing
an apparently inappropriate change
in the
to the
presence or absence of the event that
is
suspected of being an adverse drug reaction.
An
event trohoe was surveyed for the
type of research used to suggest that oral
might lead to thromboand that the use of certain estrogens in pregnancy might lead to vagiAlthough nal carcinoma in the offspring. contraceptive
phlebitis"-
f
"
pills
,J
1
modem-day
hospital
'
1
surveillance
systems
The
were
established
studies of
drug
use
also
performing
for
cohort
effects, the investigators
the
data
collected
can
trohoc
for
research. Because of the positive relation-
ship
found
in
trohoc study performed
a
with data of a pharmaceutical surveillance system,
27
pharmaceutical surveillance
difficulties of
an interesting current controversy
277
A mor-
mortalitv rates mentioned earlier.
rate contains three ingredients:
tality
number
per specified time interval.
number
fied
has arisen about the association between
A
secular "re-
two ingredients:
action" rate contains only
the
the
of deaths per specified population
of reactions reported per speci-
time interval.
A diverse group of have been reported
Registry trohoc.
2.
and myocardial infarction. The relationship was denied in a cohort studv performed by the Framingham epi-
in
the registry
for
demiologic team. 7
in
any drug, D. Suppose we are interested a particular type of reaction, X. Using
coffee-drinking
c.
Registry collections. All of the other
epidemiologic procedures to be described here
are
based on the analysis of data
assembled national/-
at the institutional," 9
(v
or international
municipal,
!T
registries that
clinical
entities
trohoc
the
will
adverse reactions
the
as
principle,
we can choose
"case" group to consist of
reported for D. lence
of
We
X among
all
then note the prevathose
reactions.
X among
prevalence of
voluntarily submitted
bv physicians or by other qualified reporters. Because of the
or drugs.
way
the registry will contain a certain
numerators without denominators. The data be used to count the frequency of the adverse reactions associated with spe-
can
cific
drugs and to provide warnings based
on changes data alone total
in
frequency, 41 but the registry
provide no indication of the
usage of the drugs or the number
of users
who were
An appealing to
is
reaction-free.
tactic
with registry data
convert the frequency counts of drug
For
comparison, our "control" consists of the
have been established to solicit, receive, and store the adverse-reaction case reports
the data are assembled, they contain
a
the reactions
of reactions reported for
3.
Accrual
some
other drug
At any point
rate.
number
the total
time,
in
number
of reported adverse reactions for the drug.
During an ensuing
interval of time,
reports are acquired.
when
ment,
The
new
size of this incre-
divided by the previous num-
ber of reports, produces an accrual rate
new
of
reactions for that drug during the
compared against the analogously calculated accrual rate for some other drug or for cited time interval. This rate can be
all
drugs.
reactions into crude incidence rates for an
estimated cohort of drug users.
The num-
ber of drug reactions
is the numerator; denominator is estimated as noted earlier; and the comparison is with his-
the
torical
or
other
suitable
control
groups.
Other analyses of the registry data are based exclusively on information stored at the registries.
From
the
many
possible
which have received thorough descriptions by Finney, 14 15 only three will be mentioned here.
strategies,
1.
Secular trends.
The
investigator notes
the frequency count of the
number
of re-
actions reported during successive calendar intervals. Suspicions are aroused by an unexplained increase in the secular trend of frequencies of the reports. (These secular
"reaction" rates differ from the secular
F.
Cause
and
The
signals, causes, rate signals,
rates different
epidemiologic
strategies
that have just been described can be
ployed in at least In one circumstance
four
em-
ways.
different
—the cause signal—we
have no idea that a particular drug and a particular
undesirable
The purpose
event
of the signal
is
are
related.
to raise sus-
picions of this possibility.
Once our
sus-
picions have been alerted,
we may
then
want to get additional evidence to decide whether the relationship between the drug and the event is actually causal, and whether the event should be regarded as an adverse reaction In
a
different
to that drug.
circumstance,
with the belief that the event
we begin is
an ad-
Problems in measurement
278
we want
verse reaction, and
frequency
of
want
also
signal
—
have
to
We
occurrence.
its
know
to
the
Secular mortality associations can suggest
might
that an undesirable event is happening but can not per se signal that a particular
mechanism — the
a
to alert us to
changes
frequency
in
that might suggest the occurrence of
or unforeseen
mechanism
to\ieit\
for
produces a
that
drug.
the-
rate
cause-
new The
signal
may not necessarily be the mechanism for proving cause: and the mechanism that produces
a
indicate
the
may
signal
rate
true
rates
adverse
Cause
7.
signals.
clinically
an undesirable event
If
such
unique,
growth of feathers on
we may have no it
an
is
may
sudden
the
as
a person's forearm.
deciding that
difficult v
suspect
Our main quest
the
associated
then
will
we
and
reaction
extraordinary
immediatel)
drug.
is
involved. Con-
method
of gener-
ating cause signals for a particular drug
procedures
cohort
with
is
group and 2.
in a control
Causes.
compare
that
the drug-treated
events occurring in
the
group.
After a
signal
suspicion
of
may be
has been raised, the next step
to
decide whether the drug can be held causal-
reactions
is
class of drugs)
sequentlv, the principal
always
not
the
of
drug (or
be
to
responsible for the associated event.
ly
the indictment the
available
was based on cohort
may be
evidence
If
data,
suitable
not only for suspecting an association but also for suggesting that the association
is
Quite often, however, the cohort may be both retrolective and un-
causal.
data
decide the frequency rate of the reaction
qualified, based on clinical hunches and
and
recollections that are quite satisfactory for
acceptability in exchange for the
its
benefits
event
of
is
drug.
the
the
If
clinically rare,
undesirable
phocomelia
such as
arousing
providing
The
proof.
inadequate
but
suspicions,
for
may
investigators
or retrolental fibroplasia, a
sudden increase
then look for more cogent evidence else-
may
suffice to trigger
where.
usual frequency
in its
our suspicions that something unsatisfactory If
happening.
is
the undesirable event, however,
is
a
frequent clinical occurrence such as cataracts, jaundice, peptic ulcer, relatively
or thrombophlebitis
ready
at
—doctors
suspect
to
first
Since conclusions about a causal rela-
drawn before
tionship have been
may
that
the
not be
event
try report try
is
submitted, the data of a regis-
cannot be used for assessing causal
The epidemiologic
relationships.
pend on the use
of cohorts, event trohocs,
est 5
condition, rather than to a pharmaceutical
uncertain elements that
an
adverse
reaction.
Lasagna
29
has described a series
—including the bleeding associated with aspirin— which adverse drug
of situations
in
effects
remained unrecognized for a long
Neither the trohoc nor the registry pro-
cedures signals.
or secular mortality associations.
can be used to generate such Trohoc studies begin only after
a suspicion
exists;
they are intended to
explore the intensity of the possible causal relationship,
not
to
signal
a
suspicion.
The
lar associations are the scientifically
these
of
activities,
containing
studies
can
are
also
many
when
mis-
Event-trohoc
weak 12 but
scientifically
strengthened
be
secu-
weak-
may produce
leading or distorted results.
special
pre-
cautions (which are seldom employed) are
taken
time.
pursuit
must therefore de-
of causal investigations
drug reaction. The event may be initially ascribed to the main condition under treatment or to a co-morbid is
a regis-
to
eliminate
or
adjust
for
major
sources of potential bias. Prolective cohort studies, although the most powerful of the scientific
(or
approaches,
are
extremely expensive)
often to
difficult
conduct
in
suitable populations.
A
well designed project using a retro-
Studies of registry data begin with data
lective
were submitted by doctors whose su. oicions have already been aroused and
although often overlooked research strategy for these goals. For example, the many trohoc investigations performed to study
that
who have
already
made
causal decisions.
cohort
is
a
potentially
effective
The
between oral contraceptive pills and thrombophlebitis could also have been done with probably no more effort or as studies on retrolective cohorts, apcost propriately chosen from practicing physiassociation
— —
cians' rosters of patients receiving different
forms of contraception. involved
in
The
potential bias
279
The prowould obviously provide signals about a change in rate; but retrolective cohort studies would usually not begin until a change in rate has been signalled by other sources of of
adverse
reaction
per
drug.
jective surveillance of a cohort
data.
non-random assignment
the
pharmaceutical surveillance
difficulties of
Rates. Regardless of whether or not
4.
of therapy for the cohorts could probably
a rate signal has been received, the only
be identified and adjusted more readily than the analogous and additional forms 'of bias that are present in trohoc research. Nevertheless, the epidemiologic data
way
routinely collected
by practicing physicians academia) are usual-
(inside or outside of
potential sources of valuable
ly rejected as
research information.
A
retrolective cohort
about the same kind of
can yield data
population, followed in the
same
clinical
and chronologic direction, as a proleetive cohort. 11 Because the data are not collected with the defined scientific standards that are possible
the
ever,
protective research,
in
use
of
retrolective
how-
cohorts
information
ized
available
from
lution
of
this
issue
the
The
is
to
study a cohort
of people receiving the drug.
procedures,
registry
or
trohoc
of reactions
None
secular
Even when
of the
associations,
an
provide
can
investigations
incidence rate.
the frequency
acquired from registry data
is
or
when
in
trohoc research, the conversion of the
results
a "risk ratio" has been calculated
to
an approximation of incidence
has required an estimation of the size of the exposed cohort, using data from sales curves
of
the
drugs
or
suitable
other
sources.
is
G. The goals of surveillance
usually spurned in favor of the standard-
verted logic of trohoc research. 1 -
of determining the incidence rate of
adverse drug reactions
depends on an im-
wading through the diverse
After
in-
reso-
terms,
concepts, and tactics of the preceding two sections,
the reader
probably ready to
is
portant point in the philosophic orientation
return to the main issue. Here, too, how-
do we want an imprecise answer to the right question or a precise answer to the wrong question?
ever,
Rate signals. If the "adverse reactions" have been thoroughly and reliably diagnosed, the techniques used with registry data examining secular trends, drug trohocs, and accrual rates can provide useful signals of changes in the incidence
surveillance?
of science:
3.
—
—
rate of reactions.
The
true rates, of course,
we
What
scientific
purpose of pharmaceutical it intended to provide an "early warning system" or a set of clinical guidelines for use of the drug? I. Early warning system. Almost everyone would agree that a prime goal of surveillance is to provide early warning about is
the
Is
a dangerous drug. In designing a system of
cannot be determined from registry data,
surveillance, however,
but could
know what kind
be approximated by the pre-
another
encounter
strategy that has been imprecisely specified.
of
we would have
We
danger.
to
would
viously described technique of cohort esti-
obviously want to be warned about clinical
(Another valuable source of rate
catastrophes, such as death, phocomelia, or
mation. signals
is
the adverse reaction case reports
assembled
by the manufacturers of the drug. These reports can be used in a manner analogous to that of registry data. Event trohocs are used to explore cause, rather than rates of incidence; and secular
mortality
associations
provide
data
about rates of death per disease, not rates
retrolental fibroplasia.
know,
however,
events?
What
as
transient
jaundice?
Do we
about
also
to
about. clinical nuisances, such skin
rashes
or
And what about
abnormalities
want
catastrophic
less
—such
as
reversible paraclinical
elevations
and
depressions of white blood count, serum calcium, or blood urea nitrogen
—that have
Problems
280
measurement
in
manifestations in clinical signs or
lit
symptoms? To construct a system of surveillance, we would need to know whether it is supposed to detect the
these
of
first,
the
dangers.
I
first
he
two, or
all
techniques
three
ob-
ol
"risk
benefit
What emerges would presumably he a balanced evaluation of the drug, providing effective guidance tor its future clinical
anecdotall)
Nevertheless, these distinctions are seldom
ol
specified
Several
in
years
population
to
discussions ago,
for
surveillance.
ol
example,
when
a
federal agency asked for bids on a contract
provide "surveillance," the instructions
to
to
the
potential bidders contained no statements about which goals, it any, were to be covered in the proposal. Because the general methodologic problems are so enormous, a successful approach to surveillance might begin l>\
usage. This that
an earl)
It an effeccan he constructed for providing
warning about catastrophes, the
system might later be amplified to include clinical nuisances and paraclinical abnor2.
A
quite different
hut as an evaluation system. The objective
(and
often
many
years
alter
pragmatic experience with such drugs as a
and
insulin.
surveillance
formal
of
"distillation"
H.
sv
The purpose for new
tern
experience alone.
clinical
Evaluation of epidemiologic
strategies
Having
contemplated
available
the
epidemiologic methods and having noted the desired goals,
we can now
begin to
evaluate the respective merits of different
methods
One
Clinical guidance.
informally
developed
)
(or old) drugs would he to attain the guidance information more promptly, effectively, and quantitatively than might he done if we waited for the appropriate
malities.
approach to the surveillance issue is to view the process not as a warning system
the kind of clinical guidance
is
been
has
aspirin, digitalis, of
detecting clinical catastrophes. tive s\ Nteiu
frequently
so
is
seldom documented or
so
quantified.
be observed, .uid the data to be acquired will obviously depend on what we want to accomplish. the
servation,
that
ratio"
hut
discussed
for different goals.
point about the methods should be
noted immediately: in order for any method to work, it must be made to work. Whatever be the procedure used for surveillance, the procedure itself must receive
merely what is had but what is good about a drug. The clinical trials that were done before the drug was approved for marketing will
surveillance (or monitoring) to ensure that
have provided preliminary evidence about its usage and consequences. This informa-
members of the cohort are entered into it and that the necessary data are being collected. The supervisors of a registry of
is
to find out not
also
tion
will
pertain,
however,
only
cohorts of patients and dosages that
were studied
in the trials.
the
to <
f
drug
When
the
drug enters the general market, it will he used in a greatly expanded spectrum of people, clinical states, and durations. The purpose of the surveillance would be to cover the new spectrum of anticipated or ^anticipated benefits and the new specie
m
tics
tati\
With this informaAvailable, we could compare the qualicharacteristics and quantitative freof adverse effects.
quenc.
of the events that occur in the
and benefits. This type of comparison would allow us to assess the spectrun
of risks
its
planned
The
activities are
supervisors
system
must
of
check
a
being carried out.
cohort surveillance
that
all
appropriate
adverse reactions must encourage the submission of reports by any physician who observes a reportable reaction. Furthermore, with either a cohort or a registry system, none of the available "signals" will
be detected unless the collected data are frequently analyzed in search of signals.
As we consider the two main goals pharmaceutical surveillance, however, tain procedural
methods may seem
of
cer-
prefer-
able to others. 1. Early warning system. In an early warning system for clinical catastrophes, we would want to be sure that all possible
The
difficulties of
pharmaceutical surveillance
281
With improvements
drug (and all ol the possible it have the opportunity to enter the system. This desideratum cannot
thoroughness of reporting and with greater care in analyzing the cited adverse reac-
he achieved with the restricted focus of an
tions, the case report
of the
list's
reactions
to
)
institution-based procedure.
surveillance system
people
of
is
treated
An
institutional
confined to the cohort
at
the
institution.
In
may
at-
technologic operation, the system
magnificence by acquiring numerator and denominator data, coding tain
statistical
standardized puterized
and
forms,
but
analyses;
com-
applying the
institutional
system can not be relied upon to observe the
totality
of drug usage and
that can provide early
reactions
warnings about
clini-
reaction occurs.
a
and
satisfactory
approach
the
in
procedure can provide relatively
to the goal of
inexpensive
an early warning
system.
The fundamental problem
in relying
case report data in a registry
on
the thor-
is
oughness of reporting. This problem can be solved or minimized if the act of reporting
is
made
toll-free
with
easy
especially
procedures as a
telephone
such
call
to
the collecting agency, by total reassurance to reporting physicians that the information
be kept suitably confidential, and by encouraging physicians to trust the motives
cal catastrophes.
will
pregnant ambulatory women taking thalidomide would not have been part of a hospitalized cohort and
physicians (for whatever reason) develop
thalidomide might not have been part of
apathy, hostility, or distrust toward federal
For
the
example,
pharmacopoeia used
patient
clinic.
a
in
Consequently,
out-
large
the
phoco-
of the collecting agency. Thus,
agencies,
melia that followed thalidomide might not
its
to
by institution-based prolective procedures. (The thalidomide disaster was actually detected by analysis of SG retrolective and estimated cohorts. '"• ) If we want to learn about any clinical :iL'-
catastrophe
that
using a drug, lance to the
who
may
occur
we cannot
in
limit
a
person
our surveil-
enumerated group of people
are followed in a cohort.
event can occur in anyone
A catastrophic-
who
receives the
and the event must have a good chance of being reported. For this reason, the best way of routinely getting warnings about catastrophes that can emerge from the entire group of drug users is to collect adverse reaction reports in a registry maintained either by the manufacturer or by some other governmental or medical drug,
organization.
The
registry
technique
be-
comes desirable here for the very reason that makes it epidemiologically unappealing:
there are no defined populations
constrain the denominators.
to
By placing no
on the denominator, i.e., by not an observed cohort, the registry allows a "total sampling," because anyone receiving the drug has an opportunity to be included when an adverse restrictions
demarcating
likelihood
of
practicing
an
agency's
receiving case reports will be reduced, and
have been
detected
the
if
registry
data
may be
too
problem
in
incomplete
be valuable.
An
additional
registry data
is
relying
on
the vigilance with which
the data are scrutinized and analyzed. For
an early warning signal to be noted, someone must be constantly looking out for it. Certain aspects of this
watchman
role
may
be better fulfilled by a "commercial" registry, maintained by the manufacturer of the drug, than by a "non-commercial" registry maintained by a national or international health agency. For example, the first public warning about thalidomide in the U.K. was issued not by a "non-commercial" registry, but by the drug's manufacturer, who announced its withdrawal from the market
two weeks before any adverse reactions had been reported in the English language
Had the "commercial" registries been even more vigilant on the European continent and in the U.K., the thalidomide catastrophe might have been noted sooner and its magnitude reduced. Since a drug's manufacturer can be ( and has been) held financially liable for compensating the victims, the manufacturer would be expected to exert a concerned literature.'
2
self-interest in
maintaining registries that
Problems
detect
iromptly
before
jgs
dangerous
overtly
become
consequences
the
well
as
11)
measurement
in
as
catastrophic.
clinically
numbers
with
patients
of
each
type
of
treated with each cor-
condition,
clinical
responding pharmaceutical agent.
mercial registries can also be expected to provide signals, of the type noted earlier,
2. Clinical guidance. An adequate early warning system could be attained with the methods that have just been described, but a clinical guidance system creates a
reactions that
quite different challenge. As noted earlier,
While supplementing the manufacturers vigilance for clinical disasters, the non-com-
about
dramatic
less
clinical
manufacturei might not avidly scrutinize. Another way of getting warnings about major clinical nuisances is through an a
institutional
surveillance
system.
The
ex-
data assembled and analyzed in system could provide cause signals other evidence for detecting non-
tensive
such
a
and
dramatic reactions that might otherwise be overlooked. current
pose
The main
using
in
surveillance systems for this the- clinical
is
difficulty
pur-
imprecision with which
the drug recipients and reactions have been identified. In
many
analyses, the reactions
we would want
to learn all
—
usage of the drug
cal
its
tion of the total ol
to investigate a prolective cohort.
could
because they do not include population or data. A
tactics
tality
the
appropriate
would be
retiolective cohort technique
also
unsatisfactory because the patients' original
medical records
may
not contain
the
all
information needed to document the diverse facets of risk
and
cohort study
is
and conditions.
scientifically
in surveillance studies sug-
We
not use registries, trohocs, or secular mor-
further subdivision according to the clinical
For example,
about the clinigood effects as
bad ones. For thorough evaluaspectrum and frequencies both risks and benefits, we would have
well as
and drugs have been associated without distinctions of the patients
its
benefit. Since a prolective
the only
method
of getting
about
data
precise
clinical
usage of a drug, the main issue in research architecture is the choice of the cohort.
gesting adverse cardiac effects of amitrip" no analyses were done to indi-
What people
cate whether the amitriptvline-treated pa-
are they to be chosen?
and the "controls" received the drug same clinical indication; and whether the compared groups were similar
question
in the
prognostic severity of their baseline
arranged by following the fate of a ran-
cardiac condition. In order for causal indict-
domly selected fraction of batches of the drug prepared by the manufacturer. Alternatively, a random sample of pharmacists
:
tyline,'
The most
tients
the
for
ments
be convincing, the patients receiving the compared treatments must be shown to have comparable clinical conditions at baseline. With improvements in to
clinical
and comparison, the
specification
institutional surveillance systems
can make
are to be
straightforward answer to this to
is
random sample of The procedure can be
take a
users of the drug.
can be enlisted as a collaborating research The cohort of patients can then be chosen as a random sample of the people
team.
to
whom
the pharmacists dispense the drug.
invaluable contributions to the screening
This
process that provides "early warning" of
would not be easy
clinical
nuisances
and laboratory abnor-
An
alternative to the case report system
the
institutional
cohort
is
the use of
representative cohorts as described in the
Such a procedure is probably expensive to be used only for delivering
\t section. tc
ea.
warnings. Furthermore, the size of
the
illowed
enorn.
type of "drug-cohort"
require
the
investigation
to perform.
creation
of
new
It
would
types
of
research teams, working outside of institu-
malities.
or
examined and how
s
to
cohorts
would have
to
be
allow surveillance of ample
tional
settings
and
enlisting
suitable
co-
operation from manufacturers, pharmacists, physicians,
and
patients.
Despite
obvious problems, such research
be too
difficult to
do
if
the
may
not
the participants are
properly approached. Pharmacists, practicing physicians, and their patients
come
the
opportunity
to
may
wel-
contribute
to
The
research surveys that are neither experimental nor esoteric, and that can provide
knowledge
better therapeutic
for everyone.
The main disadvantage of studying such a drug cohort is that we would lack a control group. No analogous information would be available for the outcome of people with similar clinical conditions that were treated in some other way. To complete the picture, therefore, we would need to perform a
we
separate investigation. For this purpose,
would want
to take a
random sample
whom we
practicing physicians from
pharmaceutical surveillance
difficulties of
of
could
283
from accounts of the events that actually in that world, as witnessed bv representative members of the people who
occur
live in
it.
Regardless surveillance
whether pharmaceutical
of is
intended to provide early
warnings or clinical guidance, we shall have to give more attention to the patient who gets the drug than to the technology of the surveillance or the manipulation of
the
still
The technology
statistics.
and
remains as undeveloped
the
What
already well developed.
statistics are
scientific ter-
get suitable data about the selected clinical
ritory
conditions and their post-therapeutic out-
verse reaction and of the clinical complex-
comes. Here, too, the performance of the
ity that
many
research
would
ficulties,
but the
minor
medical societies enthusiastically
if
entail
if
practicing physi-
cians accept the opportunity to
new form arises
of
clinical
work
investigation
1.
Borda,
Gilman,
that
of J.
but from the real world of medical
and
B.,
drug
M.
A.
usage
Collaborative
gram:
Interaction
and warfarin,
With these two representative samples of drug users and patient conditions, we
1972.
cover
the
spectrum
necessary
system to provide effective guidance about the use of drugs,
surveillance clinical
We
the job will not be easy. to
study what
convenient,
is
shall
valid rather than
and we
shall
have
what
4.
5.
new
in
clinical
epidemiology would allow the people who are most involved in the use of drugs manufacturers, pharmacists, practicing physicians,
and patients
ing investigators
—
who
to
become
lems that the drugs create. that
collaborat-
help solve the prob-
The questions
originate
in the extensive world of medical care would be answered not from
data assembled or
by armchair epidemiologists
institution-bound
academicians,
but
Engl.
N.
Surveillance
J.
Drug
chloral
Med.
Pro-
hydrate
286:53-55,
Surveillance
Pro-
Collaborative
Decreased
Drug
clinical
Surveillance
efficacy
Pro-
of propoxy-
7.
8.
L.
E, Thornton, G. F, and
Seidl,
L.
on the epidemiology of adverse drug reactions. I. Methods of surveillance, A. M. A. 188:976-983, 1964. J. Dawber, T. B., Kannel, W. B., and Gordon, T: Coffee and coronary heart disease, Am. G.:
have to develop
challenge
hospitals,
versity Press. 6. Cluff,
J.
This
Studies
phene in cigarette smokers, Clin. Pharmacol. Ther. 14:259-263, 1973. Bradford Hill, A.: Principles of medical statistics, ed. 9, New York, 1971, Oxford Uni-
restricted applications of institutional tech-
nology.
Drug
between
Collaborative
Boston
gram:
is
an appropriate coterie of investigative personnel rather than to rely merely on
C:
Boston
five
B„
Dinan,
D.,
T.
gram: Adverse reactions to the tricyclic-antidepressant drugs, Lancet 1:529-531, 1972.
of
and clinical phenomena. The methods of implementing the basic design are beyond the scope of this discussion. The main point is that if we want a pharmaceutical
in
Boston
Boston
Stone,
Chalmers,
A. 202:506-510, 1967.
care.
3.
H.,
Jick,
I.,
in a
2.
could
it.
References
not from isolated academic or federal
cloisters,
surrounds
dif-
logistic
may become
difficulties
endorse the plans and
the basic identification of an ad-
is
Studies
Cardiol. 33:133, 1974.
DeNosaquo,
The
'(
Abst.
on adverse reAmerican Medical Association, Med. 4:15-21, 1965.
N.:
registry
actions of the
Methods 9. Editorial:
Inf.
Beporting of
drugs, Can. 10. Feinstein,
How Clin.
do
adverse
Med. Assoc. A.
B.:
J.
Clinical
we measure
reactions
to
92:476-477, 1965. biostatistics.
"safety
and
IX.
efficacy"?
Pharmacol. Ther. 12:544-558, 1971.
11. Feinstein,
A.
B.:
Clinical
biostatistics.
XI.
Sources of 'chronology bias' in cohort statistics, Clin. Pharmacol. Ther. 12:864-879, 1971. 12. Feinstein,
A.
B.:
The epidemiologic
Clinical
trohoc,
biostatistics.
XX.
the
risk
ablative
Problems
284
measurement
in
Clin. Phar-
and 'retrospective' research Ihih 14:291-307, 1973
i
An analysis ol diagnostic The construction ol clinical algorithms, Vale J. Biol. Med. 47:5-32, 197 Finney, D. [.: The design and logic ! a A.
einstein,
reasoning.
R.:
monitor
drug
ol
use.
Chroni(
|.
1S:7T
Dis.
'is
Kanarek,
l>
signalling ol adverse
Systemati<
|
Methods
reactions to drugs,
29 Lasagna,
1974 16
G.
Nan Brunl
I
in
patients.
The P.,
inn-..
tern,
<
and
I
Harris,
1
l>.ivis.
S.:
Ex
in
out-
I.
A. M. A and Watson.
I.
:
J.
Adverse drug
Pharmacoi
Ther,
k
I
Vdams, L
.
I
Short-term intense surveillance reactions,
Clin.
|.
\. Ulfelder, L Adenocarcinoma
Herbst,
maternal
ol
13:61-67,
and
II..
.
C.:
ciation
the
ol
stilbestrol
Poskanzer, Asso-
therap)
with
monitoring
hospital
Med.
Br.
drill's,
J.
adverse
ol
to
1:531-536, 1969.
Inman, W. II. W., and Adelstein, A. M.: Him- and tall ol asthma mortal it) in England and Wales in relation to use ol pressurized
W.
Inman.
in J.
Med
Br.
Institute of Pathology.
24. Jick, H., Slone, D., Borda,
Efficacy
S.:
tion
and
toxicity
age and
to
N.
sex,
I.
of
and Shapiro,
T.,
heparin
Engl.
rela-
in
Med. 279:
J.
284-286, 1968. 25. Jick,
G.
H..
V.,
Shapiro.
S.,
and Slone,
drug surveillance, 1455-1460, 1970
hensive 26. Jick,
H.,
the
effects
clinical
Bo
Lewis, G.
oral
Miettinen, O.
Heinonen, O.
P.,
S.,
analgesic
drugs,
Collaborative
1971.
Neff, R. K., Shapiro,
and Slone,
myocardial infarction. i
P.,
for assessing
Phakmacol. Theh.. 12:456-463,
fick, H.,
a
of
Lewis,
Compre-
M. A. 213:
A.
S.,
S.,
D.:
A new method
V.:
>.,
|.
Slone, D., Shapiro,
and Siskind, Clin.
ab-
B.
R.:
Ad-
Experience ol Marv during 1962, |. A. M. A.
Thalidomide and congenital Preventable drug reactions
—
Med. 284:1361-
|.
W.
Cornwell,
[.,
B.,
Dingwall-Fordyce, I., Turnbull, Weir, R. I).: Cardiotosieit v ot
B.
|.
I.,
and Huedv
Adverse drug
[.:
,
during hospitalization, Canad. Med.
97:1450-1457, 1967.
B. A., and Kosenow, W.: Thalidoand congenital abnormalities, Lancet I:
Pteiller,
mide'
1962.
15-16,
37 Reidenberg, M. M.: Registry of adverse drug reactions, J. A M. A. 203:31-34, 1968. 38. Royall, B. W.: Monitoring adverse reactions Chronicle^ 27:469-475, 1973. to drugs.
WHO
39. Sartwell,
P.
and
embolism
Masi,
E., B.,
demiologic
A.
Arthes,
T.,
and Smith, H.
E.:
contraceptives:
oral
case-control
Am.
study,
F.
G,
Thrombo-
An J.
epi-
Epi-
demiol. 90:365-380, 1969. 40. Slone,
D.,
Jick,
Feinleib,
Bellotti,
Borda,
H.,
C, and Gilman,
Chalmers, T.
I.,
Muench,
M.,
IL,
Lipworth,
B.:
Drug
L.,
surveil-
lance utilizing nurse monitors, Lancet 2:901903, 41
1966.
Smidt, N. A., and to
inpatient
survev,
C:
Adverse
A comprehensive
hospital
McQueen,
drugs:
reactions
N.
Z.
Med.
E.
J.
76:397-401,
1972.
M. P., and Doll, R.: Investigation ot between use ot oral contraceptives and thromboembolic disease, Br. Med. [. 2:199-
42. Vessey,
relation
Miettinen, O.
Siskind,
P.,
congenital
1961.
G.:
k.,
and
J.,
Assoc.
36
Diagnostic problems and methods
Armed Forces
and
Per-
1.
1:45, 1962.
C. Crooks,
I).
reactions
C,
in drug-induced diseases. Parts I, II, and III, Washington, D. C, 1966, 1967, and 1968, respectively, American Registry of Pathology,
196
7: 157-170.
L.:
Greene, C.
age.
1969.
1971.
35. Ogilvie,
deaths
women ot child-bearing 2:193-199, 1968.
23. Irey, \. S.:
i
M.
W,
II.
of
adverse drug
ol
amitriptyline, Lancet 2:561-564, 1972.
and Vessey, M. P.: Infrom pulmonary, coronary, and cerebral thrombosis, and embolism vestigation
1,
W.
O'Mallev,
1969
aerosols, Lancet 2:279-2S5,
22.
reactions
H.,
de-
reactions.
Melmon, k 1368,
L973
vagina.
R.
Factors
drugs cause.
causes and cures, \. Engl.
J.:
tumor appearance in young women. \ Engl. Med. 284:878-881, 1971. J. .md Wade, 20. Hurwitz \ I..: Intensive
21.
33.
adverse ding
ol
Pharmacol.
II.
N.
abnormalities. Lancet 2:1358, 1961.
11:802-807,
and Fallon,
.
J.
diseases
hospital
McBride,
pharmacist-based monitoring sys
\
E.:
Med. 280:20-26,
Thalidomide
190:1071-107
34, Moir,
is Gra)
1).
Med.
drug
Fletcher
217:567-572, 1971.
[.
1970.
19.
The
Sweet.
"..
A.
McDonald, M. C, and Mackav, verse
I..
Drug Moni-
Permanente
Kaisei
i\
i
\1
monitoring drug reactions
perience
Cardnei
Collen
D.,
toring System, 17.
1..:
normalities, Lancet 31.
Friedman, E.,
Engl.
W.:
30 Lenz,
Med. 13:1-10,
Inf.
Eaton,
Y
reactions,
spect. Biol.
Finne)
and
W
V.
Sidel,
J.,
P.,
Center,
1973.
termining physician reporting
1963. 15.
Medical
Cniveisitv
Med. 289:63-67,
J.
28 Koch-Weser,
III.
1
14.
Boston
gram, Engl.
Drug
A
35-37, 1968.
44 Wintrobe, M. M.: The therapeutic millennium and its price: Adverse reactions to drug, in Talalav, P.: Drugs in our society. Based on a conference sponsored by
Coffee
University,
from the
kins Press.
D.:
report
205, 1968.
43 Weston, J. K.: The present status of adverse drug reaction reporting, J. A. M. A. 203:
Surveillance
Pro-
Baltimore,
The fohns Hopkins The Johns Hop-
1964,
SECTION FOUR
MATHEMATICAL MYSTIQUES AND STATISTICAL STRATEGIES
Mathematical theories often produce fears or fascinations that divert medical readers and investigators from the vious three sections. Readers
who
many
scientific
problems discussed
in the pre-
get flustered by parametric estimators, pooled
variances, a levels, standard errors, confidence intervals, regression coefficients, fS
and
errors,
logistic transformations
may be
too confused to look behind the
facades and scrutinize the quality of the scientific architecture and
statistical
data. Alternatively, after
becoming infatuated by the idea that
statistical
are panaceas for the intellectual ailments of research, investigators basic scientific challenges in structure, bias, or data
problems
will
methods
may
and may assume
neglect
that
any
be solved by multiple regressions, discriminant functions, analysis
of covariance, or partitionings of chi-square.
These inappropriate attitudes about the power of often encouraged by the
way
for investigators as a set of
even
the presentation
may
what
is
being done or why. For
stoves,
and kitchen maneuvers, while remaining
oblivious to the fundamental culinary objectives in texture, flavor,
statisticians.
taste.
first set
scientifically
of ideas
is
hazardous for both investigators
concerned with the major
scientific dis-
difference. In
most
of "statistical significance," the investigator evaluates the likelihood that
chance
The
The
between estimating a parameter and contrasting a
tinctions tests
and
four essays in this section are concerned with basic statistical ideas that
have been confusing, distracting, or
and
statisticians,
receive so inherently mathematical a focus that the cook
becomes a connoisseur of ovens,
The
techniques are
They may be presented
"cookbook" instructions to be followed obediently,
the "cook" has no idea of
if
statistical
the techniques are taught.
is
responsible for the difference found
when two groups
are contrasted.
and other conventional procedures used for this purpose were developed for a different purpose: to estimate a parameter for a single group. The conversion from a onegroup estimation to a two-group contrast was desirable because it offered major t-test,
chi-square,
are derived from mathematical theories that
advantages
in
computation, but
it
voluted,
and hard
modern
electronics, investigators
permutation
to understand.
has
made
the statistical process indirect, con-
With the ease
may
of computation that
tests that are direct, straightforward,
The second
is
offered
by
prefer to use an alternative method: the
and easy
to understand.
essay deals with problems in the mathematical duplexity of cer-
285
Mathematical mystiques ami
286
tain directions
lationship
statistical strategies
—the "one-way" dependence or "two-way" interdependence of a
— and
the choice of a unilateral or bilateral zone of probability for
The
"rejection" of the celebrated null hypothesis. issue that
is
seldom considered
difference
is
"statistically
nosis,
re-
and the P
in
third essay
is
devoted to an
The decision that a making a positive diag-
conventional instruction.
significant"
is
somewhat
like
value or probability zone established as an a level denotes the
possibility that the diagnosis
is
positive. If the difference
falsel)
is
regarded as
making which rarely receives adequate
not statistically significant, however, the investigator takes the chance of This latter problem,
a false negative diagnosis. attention
and
when
investigators studv statistics, involves the contemplation of
a different type of
P value. The « and
fi
/?
levels
levels of probability are the bases
for a statistically popular indoor sport: the calculation of
sample
size in a research
project.
The
fourth essay contains a catalog of defects in the
way
investigators
ploy statistical tactics to summarize or display the data reported in scientific ature.
The standard
stead of a standard
confidence interval to
error
regularly used improperly in
is
deviation
show
is
not identified.
P values
sented nor summarized.
or
A
± sign
a standard something, but the something
portrait that
is
needed
to illustrate the relation-
summarized
with correlation coefficients or regression coefficients.
and perhaps worst of
calculations for
the spread of data or instead of a
often omitted and, instead, the results are
inefficiently or misleading!)
Finally,
mean and
The graphical is
show
liter-
either in-
the estimated zone of location for the mean.
frequently appears between a
ship of two variables
to
two ways:
em-
F
all,
may be the statistical and the actual data may be neither pre-
the only reported results
values;
CHAPTER
20
Permutation
tests
and
'statistical
significance'
The previous paper"
in this series
was
the "luck of the draw" produced a sub-
concerned with the advantages and pitfalls of using a randomization process in
stantially distorted sample.
from a parent population. By depending on chance alone, the randomized selection removes any element of
tion.
human
within each stratum.
sampling
choice in the sample, and allows
the results
based
ences
These is
be interpreted with inferon statistical probability.
statistical
useful
larly
to
inferences
when
being sampled to
are
particu-
parent population estimate the value of an the
unknown parameter, such as the mean of a selected variable. From the results found in the random sample, the investigator can demarcate a zone of values, called a "confidence interval," and can have a speci-
An
antidote for this hazard
is
stratifica-
cogent subgroups (or strata) can
If
be defined beforehand, the members of the sample are selected randomly from stratification is
is
not
drawn with a
tion, the
If
tvpe of pre-
this
sample
the
feasible,
non-specific randomiza-
cogent strata are demarcated
af-
terward, and the data are analyzed within
The results found in the individual can then, if desired, receive a "standardized adjustment" to create a single value that estimates the desired param-
strata.
strata
eter
appropriately
for
the
entire
parent
population.
that the true value of the
As noted previously,"" 11 13 the selection of cogent strata and the analysis of po-
within
tential
fied
"confidence level" for the probability
parameter lies demarcated zone. The main of random sampling is that it does the
-
bias
are crucial scientific require-
lar
ments that have seldom been adequatelv fulfilled in modern clinical and epidemiologic research. The requirement is particu-
search.
larly necessary in these
pitfall
not indicate the reliability of the particu-
sample that was selected in the reThe randomization process provides no safeguards against the 5% of chance occasions in which a 95% confidence interval will be erroneous because
forms of research because the groups of people under investigation have almost never been assembled by random sampling. Almost all of the
existing surveys of the causes, occurrence,
—
This chapter originally appeared as "Clinical biostatistics XXIII. The role of randomization in sampling, testing, allocation,
and
credulous
idolatry
Pharmacol. Ther. 14:898, 1973.
(Part
2)."
In
Clin.
and treatment existing
of disease,
experimental
and
trials
of
all
of the
therapy,
have been based on groups that were not
287
288
Mathematical mystiques and
statistical strategies
chosen randomly. The groups have conpeople conveniently available to the investigators at such locations as hos-
sisted of
physicians' regis-
plants,
industrial
pitals,
search.
the
noted
two
A. The
main
uses
of
statistical
inference
The
discover)
that
surprising shock to investigators
a
work
other
in
domains.
who
With inanimate
chemists achieve random sam-
materials,
ples routine!)
and easily
as
an aliquot of
homogeneous mass. With general human populations, social and political scientists give careful attention to methods of sampling and getting random selections. With a
medical populations, however, the investiare almost never random.
gative samples
Why
medical
are
researchers
so
delin-
quent"
The answer the
two
to this
question
based on
is
different purposes of statistical in-
A
ference.
socio-political
scientist
often
wants to estimate a populational parameter, whereas a medical investigator usually wants to contrast a difference in two groups.
a
goal
A random
for estimating
sample
is
mandatory
a parameter,
but has not been regarded as equally imperative for
dom
most
of
epidemiologic
re-
investigator usually wants to
maneuvers under study the results exposed vs. non-exposed
effects of different
groups or
major
and
the
—
"case-control"
in
vs.
or in "treated" vs. "untreated."
medical researchers seldom use random samples often comes as
The
in
peoplc\
not
is
clinical
compare the in
schools, etc.
tries,
parameter forms of
diseased,
He
is
sel-
interested in populational parameters
or their confidence intervals.
He wants
know whether the magnitude
to
of the dif-
ference observed
in
ular set of data
more than can readily be
is
the results of a partic-
expected from numerical chance alone. The' contrast of the numerical difference
with different maneuvers is a fundamental purpose in many research activities, and is the primary objective of most clinical and epidemiologic investigations. This goal does not appear to be well understood, however, by some of the leading statisticians of our era. For example, so excellent a statistician as G. W. Snedecor has stated that "the purpose of an experiment is to produce a sample of observations which will furnish estimates of the parameters of the population together with measures of the uncertainty of these estimates." This misconception of medical research is relatively widespread among associated
statisticians,
and may be responsible many current problems
for
contrasting a difference.
some
Estimation of parameters. In polling political beliefs, our main concern is to
consultants in biomedical investigation.
get an estimate of the quantitative parti-
For analytic contrast, an investigator seldom cares about estimating the parameter of some unexamined theoretical parent population, and seldom wants to measure the uncertainties expressed with such calculations as "standard error of the mean." He has performed a single act of research on a single collection of people, who were divided into two groups. He has noted a
1.
tion
of public opinion.
There
is
no con-
any imposed experimental or conmaneuvers. We simply want to find out what the people think politically, and trast of
trol
we want sample large.
to
know how
We
accurately
the
views of the public at
reflects the
therefore
want
to estimate the
values and confidence intervals of populational parameters.
A random
sample being
crucial for these activities, a political scientist
pays
close
attention
to
getting
a
suitable collection of people for the survey. 2.
i
contra^
umerical evaluation of an analytic The estimation of a populational
31, 31,
of their
as ''•
,;
'
.'.:>
difference in
the numerical data for the
contrasted groups, and he wants to
know
whether this numerical difference might arise by chance alone. Suppose an investigator has reported success rales of 75% in a treated group and 25% in a control group. Before draw-
Permutation
tests
and
'statistical significance'
289
any further conclusions, we immediately want to know whether enough people
sible
were included
what Mainland'-' has called the "eye test." We could draw a reasonable conclusion by just looking at the data. The decision would be more difficult if the 75% and 25% came from such numbers as 9/12 vs. 5/20, or 3/4 vs. 10/40. For these and for the manv other contrasts whose numerical distinctions cannot be immediately judged with an "eye test," we want some cerebral mechanisms to supplement the ocular per-
in;
w; and so on. The number of different permutations for a group of four different objects is 24 (= 4 x 3 2 1). Thus, each disz, y, x,
appear in 24 ways among the 1680 arrangements. We therefore divide 1680 by 24 and get 70 as the number of distinctive quartets that could be formed by dividing eight objects into two groups of four each. tinctive quartet will
a.
trate
Data expressed procedure,
this
same data form the
that t
in
means. To
illus-
us
consider the
we examined
earlier to per-
test.
The
let
results
in
Group A mean
are 20, 15, 12, and 6 units, with a of 13.25. 2,
and If
1
The
units,
these
divided
results in
int
with a
Group B are mean of 3.25.
ight people were
two
pairs
of
7,
3,
arbitrarily
quartets,
how
Half of these 70 arrangements are similar but
opposite.
Thus,
the
quartet
ic,x,ij,z
"The exclamation point in the formula refers to the Thus 5! = 5x4x3x2x1. By con-
"factorial" of a number.
vention, 0!
—
1.
\
'
Permutation
Table
II.
Means and
differences in
means
and
tests
297
'statistical significance
for 35 arrangements of eight
observations divided into two groups Difference
Mean, Group A
,
Group A
Group B
Mean, Group B
in
means,
A-B
12.7
6,
3, 2, 1
13.5
3.0
10.5
20, 15, 12,6
7,
3,2,
1
13.25
3.25
10.0
1
20, 15,
3
7,
6,2,
12.5
4.0
8.5
20, 15, 12, 2
7,
6, 3, 1
12.25
4.25
8.0
20, 15, 12,
1
7,
6,3,2
12.0
4.5
7.5
12,
3, 2, 1
12.0
4.5
7.5
20, 15,
7,6 7,3
12,
6,2,
11.25
5.25
6.0
20, 15,
7,2
12,
6, 3, 1
11.0
5.5
5.5
20, 15,
7, 1
12,
6, 3,
2
10.75
5.75
5.0
20, 15,
12,
7,2,
1
11.0
5.5
5.5
20, 15,
6,3 6,2
12,
7,3,
1
10.75
5.75
5.0
20, 15,
6, 1
12,
7, 3,
2
10.5
6.0
4.5
1
10.0
20, 15,
20, 15,
1
2.
1
20, 15,
3,
2
12,
7,6,
6.5
3.5
20, 15,
3, 1
12,
9.75
6.75
3.0
9.5
20, 15,
2,
1
12,
7,6,2 7,6,3
7.0
2.5
20, 12,
15,
3, 2, 1
11.25
5.25
6.0
15,
6,2,
10.5
6.0
4.5
20, 12,
7,6 7,3 ",2
15,
6, 3, 1
10.25
6.25
4.0
20, 12,
7, 1
15,
6,3,2
10.0
6.5
3.5
20, 12,
6,3
15,
7, 2, 1
10.25
6.25
4.0
20, 12,
6,2
15,
7, 3, 1
10.0
6.5
3.5
20, 12,
6,
1
15,
7, 3,
2
9.75
6.75
3.0
20, 12,
3,2
15,
9.25
7.25
2.0
20, 12,
3, 1
15,
9.0
7.5
1.5
20, 12,
2,
15,
7,6,1 7,6,2 7,6,3
8.75
7.75
1.0
20,
7,
15, 12,2, 1
9.0
7.5
1.5
20,
7,
6,3 6,2
15, 12, 3, 1
8.75
7.75
1.0
20,
7,
6, 1
15, 12, 3, 2
8.5
8.0
0.5
20,
7,
8.0
8.5
-0.5
7,
3,2 3,1
15, 12, 6,
20,
15, 12, 6, 2
7.75
8.75
-1.0
20,
7,
15, 12, 6, 3
7.5
9.0
-1.5
20,
6,
2,1 3,2
15, 12,7, 1
7.75
8.75
-1.0
20,
6,
3, 1
15,
7.5
9.0
-1.5
20,
6,
2, 1
15, 12, 7, 3
7.25
9.25
-2.0
20,
3,
2, 1
15,
20, 12,
1
1
1
12,7,2 12,7,6
might be in Group A and the quartet s,t,u,v might be in Group B; or vice versa. Conse-
10.0
6.5
By
-3.5
inspecting the results of Table
we can
II,
get the answers to the questions
An
"exceptional" difference
quently, there are really 35 distinctively dif-
asked
ferent pairs of quartets to
means of 10 units or more in favor of Group A occurs twice in the 35 arrangements shown in Table I. If we completed the rest of the table to form 70 arrangements, we would find that a difference of means of 10 units or more in favor of B would also occur twice. Thus, at a twosided level of interpretation, for any dif-
be considered.
Those arrangements 12,7,6,3,2
and
1
gether with the
for the numbers 20,15, shown in Table II, tomeans and difference in
are
means
for each of the 35 arrangements. Exchanging the contents of Groups A and B, and reversing signs for the differences in
means, we could obtain the remaining 35 arrangements that complete the 70 possibilities for
those eight numbers.
earlier.
in
ference of 10 units or more, the probability
is
4 70
=
0.05714.
At a one-sided
Mathematical mystiques and
of interpretation,
level
statistical strategies
the difference
for
of 10 units in favor of A. the probability is
0.02S57.
2 70
These sided
P was
interpretation.
of
level
At a two-
test.
t
we
from what
results are different
found earlier with the
be-
tween 0.025 and 0.02 with the t test, but was 0.057 with the exact random perit mutation test. Tims, if we were adhering to a strict a level of below 0.05 for "sta
a one-sided
level
ever, both results
was below
0.012">
0.029 with
tin
From Table {
II.
According
to
in
the
2 15 II,
calculated
means
in
observed
12
ot
of
should
distribution
V and L7.85 none howevci mkI
confidence ')">',
Foi
noted
empiricallj
the
differences
ol
35
tin
each
in
pairs
ol
The
table
of the
many
arrangement.
arrangements
be considered. [The number of possible arrangements is
different
(8!x7!)
15!
to
6135.]
Fortunately, an easier
wav
ing the desired information
by
R. A. Fisher.
1
the table has the con-
It
shown
struction
of determin-
was developed
with
earlier
the
letters
F.N, Fisher showed that the pro-
a.b
portion
of
the
permutations
would
that
\'
is
B! SI F!
N! alb! eld!
between Table exceeded
For the particular table
we
observed, this
value would be
in
samples
8! 7! 9! 6! 15! 7! 112! 5!
li.nl
that
They include both
the pos-
pf and the possibility
test
draw
the sampling distribution for
statistic at
each degree of freedom,
the curves of each array of relative fre-
quencies, and to calculate the areas that
der the curve beyond each value of the tistic
that
Pi-
To determine
con-
100 degrees of freedom
1
ference in the two proportions, p, and p 2 Regardless of whether we are interested in p t
P2
which shows the curve of the
x
which the chi-square
plied.
For the one-sided
the particular chi-square curve that
is
pertinent for the 2
calculation of
relative frequencies of
P
degree of free-
1
example of the sampling
Fig. 2 gives an
culate the value of
chi-square,
at
0.025.
tribution of chi-square for
X
or chi-square.
2
contemplate only the right hand
this distribution.
of those individual imaginary samples, t
x
dom.
"sampling distribution" of the
"test statistic"
number of "degrees of freedom"
6
5
Every-
theoretical parent population in the sky. find
0.05
the
is
thing else takes us into the world of mathe-
we
=
Values of X 2
statisti-
were actually observed. This
the data that
thing
such as
we
First
4
3
2
the results that
of significance, however, everything
becomes more complicated.
last
area
test,
gives us the actual probability values
random rearrangements of
for
permutation
a
such as the Fisher exact probability
test,
lie
un-
test sta-
would be a formidable chore. Fortunately, be spared all this work because our sta-
we can tistical
colleagues have already done
They have worked out
it
for us.
the complete details of
each theoretical sampling distribution for each each "degree of freedom". The have been organized so that we need not
test statistic at
details
even look
at the distribution
curve to erect a
Mathematical mystiques and
statistical strategies
two-sided results. For the situation just de-
perpendicular line and measure an "external"
o\
area of probability. This area has already been
scribed,
measured
for us.
the P values that
becomes
It
are listed in the familiar tripartite tables that
show
confluence
a
three items: the value
o\'
(or whatever tesi statistic
t
number
associated
being used); the
is
degrees
o\'
o\'
freedom: and
o\
same callv
to
("one-sided")
lateral
sided"). t^
we
decide whether the P value
/x 2
.
know
test
the
than or less than esis
If
('"two-
bilateral
the procedure a
On
trickier.
a chi-square test, rath-
is
situation
the
t-tcsi.
somewhat
is
seeing the asymmetrical "one-tail"
curve of the chi-square distribution, users of the
[.
|
the alternative hypoth-
that the idea of tvvo-sidedness is built into the
i
hypothesis
is
/x,
is
we want
since
/x,
such as
unidirectional,
is
uni-
is
I
"statisti-
either greater
that
fig.
seek
way we formu-
significant'*.
than
er
and
.05,
mav become confused and may conclude erroneously that the associated P value refers to a unilateral probability. The fact is, however,
bilateral,
is
possibility
the
or
alternative
the
If
have
still
<
mav or may not be
oi data
set
P
"statistically signifi-
hypothesis and use the t-tables, the
late the null
When
however, we
<
.025
get
would now be
cant". Thus, according to the
the associated value for P. In using these tables,
we would
the result
<
/x,
/x_>.
to
the
test
calculation oi chi-square
In a t-test for
itself.
|
have only
probability should correspondingly
point,
this
illustrate
found a mean value oi
mean of
14.4 in
Group
we want
hypothesis,
=
10 vs. x B
=
12, the
what we would get
one direction
To
xA
I
1
suppose we have
.3
Group A and
in
a
B. According to the null
know
to
finding a difference of 3.1
the likelihood o\
by chance.
units
t
=
xA
if
value 12
opposite to
is
=
and x B
10.
=
.10 and p B = .12. however, the chipA square value is identical to what we would get If
;
|
.12 and p B = .10. Because the probabilities associated with the asymmetrical chi-
=
for p A
square curves are always two-sided,
Suppose each group contains 51 members, with
find a one-sided result
the standard error estimated as
the "other side" of the curve.
,
j
we cannot
|
by looking for a value on |
The associated value of
population.
=
3.1/1.7
.7 units for the
1
1.82.
our
If
test
would be
t
two-sided,
is
we
We
half the associated value of probability.
example,
in the
values at one degree of freedom, P
found on both sides of the symmetrical
X
tribution t
—
the probability of getting a value of
greater than
.82 and the probability of getting
1
For
customary table of chi-square
are really asking for the probabilities that are t-dis-
simply take
is 0.
10
when
0.05 when x 2 is 3.841. Consequently, a x value of 3.12 would not be "significant" in a two-sided test because 0.05 2
2.706 and P
is
j
!
is
2
j
j
a value of
two
less than
t
probabilities
is
ventional bilateral
—
The sum of those what emerges in the con-
test.
.82.
1
When we
of two-sided probabilities for
we
of freedom,
and P
=
our observed
<
.
1
,
t
P =
see that
.05 for
t
=
0.
look
1
in a table
100 degrees
at
t
for
t
=
1
of 1.82, is
we
find that .05
<
P
not "statistically sig-
On
the other hand, if we expected Group B have a larger value than Group A. so that
our null hypothesis was quent to
formation,
show
=£
/u.
B
We
,
the subse-
would want
only the probability of encountering
a value of
that
ix A
would be one-sided.
test
know
<
t
greater than 1.82.
we can
either
go
To
find this in-
to a set of tables
the one-sided probabilities or
we can
take half the probability values listed in a table
0.
1
If the
.
test
were one-sided, how-
same value of x2 would become "significant" at P < .05 because the associated probability values would be halved. These
illustrations of both the
and the "test value have
statistic"
al
are
permutation
methods of getting a P
shown why an
ally prefer the
They
nificant"'.
to
P
ever, this
.661
1.982. Consequently, for
so that the result
<
unilateral
investigator will usutests
more consistent with
of probability.
the unidirection-
design of a scientific hypothesis and they
also offer a "better"
vestment
in
ways more cant"
if
P value for the same The results are
in-
research data. likely
to
be "statistically
al-
signifi-
the probabilities are evaluated in a one
sided manner.
Many
statisticians,
however, are reluctant
accept the idea of allowing a directional entific decision to affect the operation
to
sci-
of a du-
The
One
plex statistical procedure.
of the most vig-
orous statements of the statistical ideology was
provided by Langley There
as follows:
places.
.
.
.
this par-
inference, even in high
ticular aspect of statistical
The whole aim of
statistical tests is to
eliminate guesswork and to put inductive logic
mathematical footing, and
this
means
on a
that these tests
must remain completely objective. This impartiality, freedom from human foible, is only possible if
this
we
stick to
lem)
is
two-sided probabilities.
.
.
(The prob-
.
solved simply by using two-sided probabili-
ties as a routine for
significance tests.
all
sided probabilities
among
is
statisticians
relied
A medical reader who on these two books might never discover
such decisions have
The remaining textbooks could be divided The describe-but-don'tprescribe books contain an account of the two kinds of decisions, but do not seem to provide any instructions on how to make the decision. The even-handed-approach books, as exempliinto three categories.
by Schor22 and by Huntsberger and Leaver-
fied
15
describe the two different kinds of hy-
,
potheses
and
particularly likely to arise
which
who have
Mainland 19
consulted
ex-
be made whenever
to
he uses a table of probability values.
ton
This adamant opposition to the use of one-
315
probabilities
rection for probability.
that
muddled thinking about
a lot of
is
17
and
direction of relationships, hypotheses,
each ,
circumstances
the
indicate
should
hypothesis
be
in
applied.
who must have met a trauma of who wanted to alter their hypoth-
tensively for research in education, psychology,
investigators
and the social sciences. In these domains, the
eses after the data were analyzed, prefers two-
investigator
of
is
seldom able
to study the effects
difficulty specifying a
dependent relationship or
ourselves a better chance of obtaining a 'posi-
sometimes
try
to
generate their
hypotheses after the data have been
scientific
analyzed rather than beforehand, and the investigators tistical
may
often want to juggle the sta-
hypothesis into a unilateral direction in
order to get the
P values down below
the
magic
marker level of 0.05. These research strategies have been thoroughly debated
in
tive'
statisti-
seem lamentably unaware. Anthologies
result".
The other books containing a
two-sides-are-almost-al ways-better- than-one
argument range from Campbell's view 4 "one-sided
... we ought always will
to
consider their possible
My own
view
is
accord with the even-
in
handed school. Since the decision should
ways be based on istician
In the
to
domain of biologic and
clinical sci-
maneu-
allows the investigator to formulate a unidirec-
directional;
hypothesis. Accordingly, one
hypothesis the
and the unilateral or ability
expectation,
books
that 14
I
went through
statis-
I
found, to
my
tion (or at least
distinction
none
between a
col-
surprise,
—by —contained no men-
two of the most famous books and by Bradford Hill 3
my
To
devoted to medical
that are
or biologic statistics.
er
whether the
should be one-sided or two-sided.
this
any
that
I
Fish-
could find) of the
unilateral or bilateral di-
is
whether
unidirectional
hypothesis
or
his bi-
should
then be appropriately one-sided or two-sided;
reasonably flexible about
lection of 18
stat-
with
decision
statistical
might expect contemporary biometricians to be
check
no reason for a
ligated to state, before the analysis, scientific
tical test
the
preconceptions. The investigator should be ob-
vers (with either experiments or surveys) often
scientific
approach
mathematically authoritative pontifications or
ence, the ability to study interventional
tional
al-
scientific rather than statisti-
cal considerations, there is
and 23.
"it
that
should almost always be two-sided".
number of "sides" in references 21
1
be safe to assume that significance tests
of the controversy over hypothesis testing and for probability are presented
that
are justified only rarely but
tests
relevance" to Armitage's assertion
an extensive
which many
series of publications of
cians
should beware of the
temptation to lower our standards by giving
and
a directional hypothesis. Consequently, certain
investigators
"we
sided tests because
may have
maneuvers
interventional
bilateral aspects of prob-
should follow accordingly.
From time
to
time, a statistical consultant will encounter an
rection.
who
cannot state any form of
di-
This difficulty usually arises not
in
investigator
deciding whether to go one in
the
way
or two
ways
the statistical null hypothesis, but because scientific
hypothesis
itself
is
either mal-
formed or amorphous. The research tionless because
it
is
is
direc-
aimless. In this situation,
Mathematical mystiques and
do everyone
the statistical consultant can
apphing
vice not by conservatively
two-sided
jective"
strategies
a ser-
.05, .01
"ob-
the
"significance",
of
tests
statistical
nificant"
,
but by advising the investigator to re-think the goals and aims of the research
others,
sequence
the
The
We
zone
size of the rejection
can now
grandiloquence
is
It
where the
the place
investi-
gator can happily set aside the null hypothesis
and proclaim those magical words and
editors,
granting
The
"statistical significance".
— sacred
to
rejection zone
is
which
called a.
we
usually taken to be
is
either .05. .01. or .001.
the P value
If
the
in
below the chosen level of a.
is
reject the null hypothesis, state that the re-
significant",
"statistically
are
sults
clude that the observed difference
This
is real.
investigator usually wants to attain in doing a test
The
of "significance".
level of
a
estab-
lishes the proportionate risk that the conclusion will
be falsely positive
—
that a correct null hy-
pothesis has been erroneously rejected. If the
we would want
null hypothesis is true, it
to accept
and draw the "negative" conclusion
results are
in this
Consequently,
decision will be
a =
if
.05,
cisions for 100 instances in
pothesis
wrong
that the
"not significant". The likelihood of
being correct
is true,
5 times.
we
will
and
if
—
1
we make
which the
a. de-
null hy-
1
-
a. which
is
thus equivalent to the specificity of a diagnostic test, is
also the source of statistical concepts of
"confidence". is
1
we
-
a, or
If
a =
95%.
.05, our confidence level
that
we
will
be right when
accept the null hypothesis and conclude that
a difference smaller, cificity
is
"not significant". As
a
gets
our "confidence" rises in the spe-
of the negative decision.
The smaller
the
the exuberance with
investigators, if
is
which "significance" is falls below a. For
P
is
respectively
symbolic
**, or ***.
sober scientist, encountering
wonder how
all this
verbal
may
these sacred boundaries were or-
why
dained and
test
the sanctification should arise
of a statistical hypothesis, in conof a scientific hypothesis
trast to the analysis
and
associated results. In a later essay in this
its
series.
give detailed discussion to the
shall
I
perversion of the word "significance" as one
many
of the
intellectual pollutants that an in-
appropriate use of statistical theory has brought to
modern medical
magic
below
level of
science. For the
moment,
a =
.05.
number in demarcating "significance" has become so widely accepted and The
role of this
worshipped
that
one might expect
ord of the time and place occurred. exist.
I
No
when
to find a rec-
the apotheosis
such record, however, seems to
have looked through a series of books
devoted to the history of probability,
statistics,
and science, but the historians do not seem have taken note of an event so prominent
development of a
to
in the
hegemony over
statistical
scientific decisions.
According tical
to
Donald Mainland 18 the ,
statis-
use of the term "significance", although
usually ascribed to R. A. Fisher,
was probably
introduced in 1896 by Karl Pearson. Pearson's
boundary
for
dividing
"non-significance"
from
"significance"
was an
entity
called the
probable error, an idea that has since become archaic. Fisher
moved to
magnitude of a, the greater
proclaimed when the P value
some
A
a
to *,
or symbolic jubilation about a P value,
be right 95 times and
The value of
restrained
however, we can focus on the choice of the
"positive" conclusion that an
the
is
and con-
is
Guide-Michelin expression of
from the
agencies
demarcated by the choice oi a fixed probability
research data
have some-
I
eureka. In computer print-outs, the machine's
a statistically blessed region called the
level,
decisive,
significant, high-
an integral
is
hypothesis testing
At the end of either a one-sided or two-sided
reviewers,
ma) be
times heard the sequence or super, wow. and
part of statistical
/election zone.
significant,
is
it
turn our attention to a magni-
tude, rather than a direction, that
test is
significant". For
and determinant. 2
ly significant,
significant"
statistically
statistically
and conclusive 2 ": or F.
upward verbal declension
the
"highly
to
"very highly
to
itself.
or .001
,
of significance ascends from "statistically sig-
the
was probably
boundary
the person
to .05. This choice
who
seems
have arisen from a mathematical phenomethat could be easily converted into a mne-
non
monically
convenient
statistical
Gaussian distribution of data, are in a zone
95%
tool.
In
a
of the values
spanned by the mean plus or minus
The
direction of relationships, hypotheses,
was so
1.96 standard deviations. Because 1.96 close to 2, the
number
membered and ± 2s) became
the expression fi
2 could be easily re-
±
a quick, simple shorthand for
"common"
denoting the zone in which the
would be encountered.
values of a distribution
5#
The
The next step
An
uncommon
or "rare".
reasoning was easy to
in the
5%
event that occurred in the outside
zone could be regarded as different enough
from the other nificant",
95%
and
of events to be called "sig-
be deemed unlikely to have
to
happened by chance. The fact was, of course, such
that
events
could
occur by
regularly
chance, with an occurrence rate of about
20 chance occasions.
every
in
1
Nevertheless,
a
boundary was needed for making individual decisions and .05
seemed
as
good a boundary
as any other, particularly since
it
had the pleas-
Some
of
language
Fisher's
elsewhere
is
quoted by Mainland 19 as follows: P
is
between
and
.1
there
.9,
certainly no
is
reason to suspect the hypothesis tested. If .02 to
it
account for the whole
often be astray
if
we draw
and consider that
.
a real discrepancy.
to be
below
.
.
...
We
the facts
shall not
a conventional line at .05
(lower) values It is
.
.
indicate
.
convenient to take
(.05) point as a limit in judging is
is
it
strongly indicated that the hypothesis fails
is
rejected; clinical trials are maintained or abruptly
this
whether a deviation
tific
reputations are
basis of the
made
—
on the
all
phrase and number:
magisterial
statistical significance at
or lost
al-
scien-
P
=s .05.
G. The size of the substantive difference
The
universally In
accepted by the
what
review the
ultimately
probably not be surprised that an act of scientific
judgment was converted
critical
into an arbi-
many judgmental
numerical shrine. So
trary
procedures have been obliterated during the twentieth century worship of technology,
many
crucial
human
so
have been ne-
attributes
glected or deliberately omitted during the collection of
"hard"
data, so
many
inadequately
designed and inadequately analyzed
statistical
standards of epidemiologic research,
much
effort has
tivity
even
is
been made
and so
to arrive at objec-
—
the sacrifice of sensibility
at
the conversion of scientific hypotheses
that
to
P
values will probably be regarded as merely a
minor note
ony of It
in the
dominant
intellectual
cacoph-
the era.
is
also not surprising that scientists con-
fronted with complex decision-making would
line.
establishment.
who
historians
course of twentieth century clinical science will
have searched for a reliable and effective guide-
considered significant or not.
Fisher did not succeed in getting these concepts
pharmaceutical agents are
terminated;
lowed on the market or withdrawn; and
enumerations have become established as the
ant relationship with 2cr.
If
317
probabilities
of values that lay outside this zone
would be regarded as
take.
2o~ (or x
and
statistical
generally accepted
Anyone confronted with
difficult decisions
would seek whatever good advice might be available.
What
the
historians
regard as astonishing, however,
probably
will is
that the
"en-
today as the "bible" of theoretical statistics,
lightened" scientists of the twentieth century
Kendall and Stuart 16
region
would have been so obsequious in accepting a guideline that was neither reliable nor effective. The P values that arise from statistical "tests
Furthermore, the
of significance" are unreliable guidelines to
.01
ary
make no mention of .05, number as the bound-
or any other specific
,
between what they
call the critical
and the acceptance region.
idea of "statistical significance"
pear anywhere in the three dall
and Stuart
does not ap-
volumes of the Ken-
text. In a footnote, the
authors
scientific decisions
because P values are
totally
dependent on the size of the groups under vestigation.
No
matter
how
trivial
no matter how
explain that they "shall not use" the term be-
the scientific hypothesis and
cause
petty or inconsequential the difference that
it
"can be misleading".
For the world of psychologic and social science research, however, and for many biometric priests
and
their clinical acolytes, Fisher's
words and boundaries have writ.
now become
sacred
Research publications are accepted or
in-
or foolish
is
being analyzed, the results will be "statistically significant"
if
the size of the sample
is
large
enough.
For example,
in a test
of the difference in
means between two groups each of
size
n,
Mathematical mystiques and
neans
by
+
Vn,
2
(Vn)(x, -
is
t
x2
if x,,
s,,
,
and
merely with changes gously, in a
2
and
(and the P value
smaller
of the difference
test
tages p, and p 2
size n. the value of chi-square
2n( Pl - p 2 )
larger)
n.
Analopercen-
in
between two groups each of
,
2
or
of
in the size
*
[(p,
+
is
p.,)(2
-
p,
-
stantive
An
4-
x2 )
remain the same,
s2
will rise or fall
t
correspondingly
get
s,
Because of the multiplication
s 2 ).
the value of will
of
value
the
(Vsj 2
and x 2 and variances
x,
statistical strategies
provement
tor
.
value
will
change
correspondingly,
according to the size of
For
this reason,
is
p2
entirely
an investigator in
x,
who
—
observes
x 2 or in p]
entirely justified in performing a t-test
or a chi-square test to determine whether n
enough
large
to indicate that the result
likely to arise
by chance. Beyond
is
is
un-
that prob-
abilistic
assessment, however, the investigator
engages
in a
soning
weird distortion of scientific rea-
+
itself,
or
if
he allows
come magnified
^
(P2
in
trivial
the
value of P
value
(the
p,
the
in
treated
-
would then be
(p,
+
purpose, the opposite
For
10).
this
p2 )
3=
or
.10,
<
(p 2
.10).
With
this
statement of a scientifically im-
increment, the simple algebra of the
portant
hypotheses has been invaded by ano-
statistical
crement
It
A, the magnitude of the
is
in-
that is substantively significant. This
increment seldom receives any major attention in statistical
textbooks or discussions, possibly
A
because the size of with any act of
cannot be determined conjecture or com-
statistical
To pay
putational prestidigitation.
A, the
macy
he makes decisions that depend on
if
degrees of "significance"
.
would want
ther symbol.
n.
an impressive difference
-
and the P
fall
whatso-
(or null) hypothesis would be that p,
,
difference, chi-square will rise or
interest
hypothesis that p, = p 2 To decide that the ob-
group) to exceed p 2 by a substantial increment such as 10%. The scientific hypothesis in the research
and p 2 Regardless of the individual values found for p, and p 2 or their p,
im-
< p2 served difference was important, the investiga-
p.)]
Chi-square thus equals 2n times a particular
28%
a
group, and 267c
might have no
in testing the null
or even that p,
Pi
"function" of
who observed
rate in the treated
in the controls,
ever
importance.
or clinical
significance
investigator
attention to
must acknowledge the
statistician
pri-
of scientific judgment.
Despite the neglect by statisticians, the size of
A
is
usually what
makes an
investigator de-
differences to be-
cide to use tests of statistical significance. If the
into "statistical significance"
observed difference, d, exceeds the size of a
merely because the size of n was large enough
chosen increment, A, an astute investigator will
make anything become "significant". The use of P values in this manner has been an especially unfortunate aspect of modern "bio-
want assurance
to
statistics",
and the trap has been particularly
may
sometimes appear
in
epidemiologic data.
In addition to being unreliable,
P values
are
not an effective guideline to scientific decisions.
The magnitude of P has nothing
to
do
is
Once we recognize
that traditional scientific
inference depends on A, statistical
whereas traditional
inference depends on a,
in
the differences
between
new
is
became infatuated with statistical tests, a scientist was obligated to establish more than a hypothesis and more than a direction for the hypothesis. The scientist
statistical
also had to decide about an increment. This
pressed in diagnostic terms,
increment
— difference —was what might the
in the results of the
by
Pi
~~
P2
-
x 2 or
be called sub-
are ready
contemplate yet another aspect of direction
a scientist: the size of the difference. In the days
contrasted groups, as expressed by x x
we
to
tistical
investigators
"statistically
not bother with any probabilistic calcula-
with the quantity that should ordinarily concern
before
is
too small, the investigator
tions.
treacherous for analysts of the huge numbers that
that the result
significant". If d
reasoning. This
usually labelled with the
terms,
a significance
scientific
Greek
1-/3 refers to
the
test. In scientific
to the directional aspect of
when conclusions
are
and
sta-
directional activity letter /3. In
"power"
terms,
/3
of
refers
being right or wrong
drawn about a.
Ex-
/3 is the rate of false
negative decisions about accepting the null hypothesis, and 1-/3 itive decision.
is
the "sensitivity" of a pos
The
direction of relationships, hypotheses,
These conclusions are regularly discussed
319
probabilities
primer of concepts, phrases, and procedures the
(seldom with any real attention to A) in statistical
and
textbooks, but the associated reasoning
statistical
analysis
of multiple
in
variables,
Clin. Pharmacol. Ther. 14:462-477, 1973. A. R.: Clinical biostatistics. XXIII. The role of randomization in sampling, testing, allocation, and credulous idolatry (Part 2), Clin. Pharmacol. Ther. 14:898-915, 1973. Feinstein, A. R.: Clinical biostatistics. XXV. A survey of the statistical procedures in general medical journals, Clin. Pharmacol. Ther. 15:97-107, 1974. Fisher, R. A.: Design of experiments, ed. 8, New York, 1966, Hafner Publishing Co., p. 2. Fisher, R. A.: Statistical methods for research workers, ed. 14, Edinburgh, 1970, Oliver &
11. Feinstein,
frequently in another one of
now appears most
the twentieth century's favorite applications of statistical
numerology: "the estimation of sam-
ple size" for a clinical tics,
and abuse of
trial.
this
The
strategy, tac-
12.
procedure will be de-
ferred for separate discussion in a future in-
stallment of this series.
13.
14.
References 1.
Statistical methods in medical York, 1971, John Wiley & Sons, Inc., pp. 159 and 104. Atkins, H.: Conduct of a controlled clinical trial, Br. Med. J. 2:377-379, 1966. Bradford Hill, A.: Principles of medical statistics, ed. 9, New York, 1971, Oxford Uni-
Armitage,
2.
3.
Boyd. Ltd.,
P.:
New
research.
15.
tistical
6.
Feinstein,
A.
R.:
Clinical
Kendall,
M. G., and
Stuart, A.:
The advanced
theory of statistics (in three volumes), Longon,
17.
Langley,
R.:
&
Practical
Co. statistics
(paperback),
London, 1968, Pan Books Ltd., pp. 143 and 18.
X.
biostatistics.
Sources of 'transition bias' in cohort
A.
R.:
Clinical
biostatistics.
Sources of 'chronology bias' in cohort
Mainland,
D.:
The significance of "nonsigPharmacol. Ther. 4:580-
nificance", Clin.
586, 1963.
statistics,
Clin. Pharmacol. Ther. 12:704-721, 1971.
19.
Mainland,
D.
Elementary medical statistics, W. B. Saunders Co.,
ed. 2, Philadelphia, 1964,
XI.
pp. 222 and 330.
statistics,
Clin. Pharmacol. Ther. 12:864-879, 1971.
20.
Miller, D. A.: Significant and highly significant.
A. R.: Clinical biostatistics. XIX. Ambiguity and abuse in the twelve different
21.
Nature 210:1190, 1966. Morrison, D. E., and Henkel, R. E., editors:
Feinstein,
of 'control', Clin. Pharmacol. Ther. 14:112-122, 1973. Feinstein, A. R.: Clinical biostatistics. XX. The
concepts
epidemiologic trohoc, the ablative risk ratio, and 'retrospective' research, Clin. Phar-
macol. Ther. 14:291-307, 1973. 10.
sciences,
Bacon, Inc.. pp. 150-
146.
Cornish, E. A.: Preface to Reference 14.
9.
&
151. 16.
p. 61. 5.
8.
inference in the biomedical
1963, Charles Griffin
Campbell, R. C: Statistics for biologists, Cambridge, 1967, Cambridge University Press,
Feinstein,
177.
Boston, 1970, Allyn
versity Press. 4.
p.
Huntsberger, D. V., and Leaverton, P. E.: Sta-
Feinstein, A. R.: Clinical biostatistics.
XXI.
A
The
—
a reader, significance test controversy Chicago, 1970, Aldine Publishing Co. 22. Schor, S. Fundamentals of biostatistics, New York, 1968, G. P. Putnam's Sons, Inc., p. 157. 23. Steger, J. A., editor: Readings in statistics for the behavioral scientist, New York, 1971, Holt, Rinehart and Winston, Inc.
CHAPTER
22
Sample
size
and the other
side
of statistical significance
Statistical significance' in
hiolngu- research
when
commonly
is
found an impressive difference animals or people.
It'
tested
the investigator has in
two groups of
groups are relatively
the
quency counts
that are
converted to proportions,
percentages, or rates; and the usual statistical
procedure would be a chi-square
test.
(To avoid
making unproved assumptions about
the distri-
small, the investigator (or a critical reviewer)
bution of a hypothetical parent population,
becomes worried about
can replace the
a statistical
though the observed difference percentiles
large
is
enough
to
be
means or
be biologicall)
for the numerical differences significant"?
'statistically
For example, units
in the
to
do the groups contain
(or clinically) significant,
enough members
problem Al-
if
Group A has
difference
may
be biologicall) impressive be-
mean
cause the second as the
first.
On
almost twice as large
is
the other hand,
if
the
two groups
each contain only a feu members, or are widely dispersed around the
may
our biologic impression numerically.
show
that the
The
the data
mean
values,
not be sustained
may
assessments
observed difference could quite
easily have arisen
The
statistical
if
statistical
by chance alone.
procedures used to
nu-
Mann- Whitney
underlying
The
calculations used for the procedures depend on
which the
the kind of basic data in
expressed.
would be
results
For dimensional data, the
cited as
means and
procedure would be a
t
were
results
the usual statistical
test.
For nominal or
existential data, the results are expressed as fre-
sum
test
or
test.
tests is
chosen accord-
statistical strategy is identical. It fol-
lows the same principle theorems
assume
in
that
was used
to
prove
elementary school geometry.
that a particular conjecture is true.
We We
then determine the consequences of that conjecture.
If the
consequences produce an obvious
absurdity or impossibility,
we conclude that the and we reject
original conjecture cannot be true, as false.
When
this
reasoning
argument proceeds
in sev-
U
Although each of these
between two groups have been discussed
of this series.
statistical pro-
ing to the type of data under examination, the
strategy that
4, 6
by a Fisher exact
ranked ordinal values, the usual
merical 'significance' of an observed difference
eral previous installments
test
cedure would be the Wilcoxon rank
it
test the
we
by a Pitman permutation
probability test.) If the data are expressed in
mean of 9.8
a
test
and the chi-square
test
the
and Group B has a mean of 17.3 units, the
t
is
is
used for the
statistical
called "hypothesis testing", the as follows.
We
have ob-
served a difference, called 8 (delta), between
A and B. To test its 'statistical signifiwe assume, as a conjecture, that Groups A and B are actually not different. This con-
I
i
Groups cance',
jecture
is
called the null hypothesis.
assumption,
we
then determine
how
With
|
this
often a dif-
ference as large as 8, or even larger, would
by chance from data for two groups having same number of members as A and B. The
arise
—
This chapter originally appeared as "Clinical biostatistics other side of 'statistical significance': alpha, beta, delta, and the calculation of sample size." In Clin. Pharmacol. Ther. 18:491, 1975.
XXXF/. The
the
result of this determination
emerges from the
is
the
statistical test
P
value that
procedure.
320
Il
Sample
At
from what was used
theorems
was impossible,
was zero.
such a circumstance, the original
In
the
i.e.,
conjecture could not be maintained.
wrong because
it
if
as
The P
tor specifies the deliberate
be as small as .000001, or
never becomes zero. There
it
always a possibility, however infinitesimal,
observed difference arose by chance
we cannot
use a statistical
test to
prove
to
always a chance of
P value
50. or whatever the
in
conjecture (i.e.,
inal
is,
null
the
1
in 20, or
that the orig-
statistical
conclusions, therefore,
must establish a concept for the
inferential
level of "significance".
from the than a,
used to demarcate value that emerges
equal to or smaller
we shall reject the null doing so, we demarcate a as the
decide that
hypothesis. In risk
(alpha)
is
statistical test is
we
of being
wrong
in this
conclusion
—but
it
is
we must take in order to have a statistical mechanism for drawing conclusions. In geomet-
a risk
rical
,,
a was always
inference,
inference,
a
i.e.,
20, although
1
editors)
in
may
zero. In statistical
customarily chosen to be .05,
is
some
investigators (or
select other boundaries such as
.
1
or .01.
A
previous paper 6 of this series contained a
way in which .05 became designated as the customary level of a. The designation came, not as a pronouncement discussion of the arbitrary
ternational Fisher.
observed difference
that the
arisen simply
from the deliberations of an
that a
after a statistical test
ing that an
in-
committee, but from a habit of R. A.
Noting
a
level
groups has
that the conclusion
The a
level
is
thus analogous to the risk of
getting a false positive result in a diagnostic test
3
Suppose we make
.
is
a diagnosis of lung can-
correct
—
a true positive. If the patient does
not have lung cancer, the diagnosis false positive conclusion. In the
ation of hypothesis testing,
and concluding
The a
real.
wrong
we want
—
situ-
make
to
a
is
observed difference
that the
is
level indicates the statistical risk
that this decision
may be wrong and
actually no difference
The value l-a can
that there
between the groups.
therefore be likened to the
specificity of a diagnostic test,
which
is
the like-
lihood that the test will have a negative result
when
the disease
is
absent.
The value of l-a when we
denotes the likelihood of being correct
do not
and thereby
reject the null hypothesis that
observed difference
the
not
is
'statistically significant'.
The kind of reasoning used
in
forming a
hypothesis' and in establishing levels of
l-a
is
based on the idea
looking for, groups,
is
was performed, and know-
diagnosis
is
.05,
we
i.e., a real
absent. is
draw the
is
customary
positive decision, rejecting the null hypothesis
diagnosis
to
correct
is
in the
wrong.
conclusion had to be drawn
was necessary
by chance, and
that
is real.
magnitude P,
that the null hypothesis
conclude
of the Deity or
two groups
will exist a probability of
however,
wants
that he
smear of a patient's sputum. If the patient does in fact have lung cancer, the diagnostic decision
P
It
the rejection zone. If the
a
called an
is
chance
when he decides
allow of being wrong
we
reasoning of grade school
geometry. This concept
In
selected
cer after finding a positive result in the Pap
was not necessary
that
is
is
hypothesis)
right.
To draw
level
boundary for the rejection zone, an investiga-
There
with total certainty that the original conjecture is
a
the observed difference in his
is
test.
'diagnostic specificity'.
results
alone. Accordingly, unlike the situation in ge-
1
and
level
however, the
even smaller, but
wrong. There
a
1.
way
regularly applied in a
is
the statistical appraisal of 'signifi-
ever, be so conclusive.
may
statistical test
is
of
rest
and diagnostic
using .05 or whatever other
value that emerges from the calculations in the
ometry,
be .05. The
cance' resemble a clinical diagnostic
be
to
makes
that
could not possibly be right.
statistical inference,
can seldom,
had
It
This reasoning
P value
situation that
that the
to
analogies
brought us to a
the geometrical logic regularly
is
a
world followed.
A. Statistical reasoning
deciding whether or
in
assumed conjecture, because
not to reject the
With
the statistical
grade school geometry. In geometry,
in
were no problems
there
prove
to
321
side of statistical significance
conclusion, Fisher chose
reasoning, the statistical
this point in the
strategy departs
and the other
size
that the 'disease'
'null
a and we are
difference between the
The chance of a
false positive
a; and the chance of a true negative l-a.
take a
5%
Consequently,
if
we
set
chance of being wrong
a if
at
we
Mathematical mystiques and
draw
reject the null hypothesis (i.e.,
ent population, these
a positive
conclusion) and a 959i chance of being
we concede the null hypothesis draw a positive conclusion).
statistical strategies
right
(i.e.,
fail
it'
to
This analogy to a false positive result and to the specificity of a diagnostic test can
q and
make
does not appear
in
phrase b) which a level
b.
nomenclature
cal
distribution of a
I
logic that customarily goes into hypothesis test-
the
To an
investigator, the test
logicall)
reason—
to
is
I
a positive event, analogous result in a diagnostic test
tance or rejection of the null hypothesis
intellectual
Thus,
virtues.
Type
nitions, a
I
for statistical
defi-
error consists of rejecting the
when it is actually true. The calculation of PA To apply these principles requires the calculation of a P value
null
hypothesis
2.
ence.
observed data and the observed differ-
8.
This particular P value, which
conventional one usually cited ture, will
in
medical
is
the
litera-
be designated here as P A to distinguish
from other P values
The procedures used
that will be
discussed
for calculating
later.
P A are pre-
sented in detail in textbooks of statistics and will
if
ble
ratio'*
or "z-score". For any single
value randomly chosen from a Gaussian
distri-
bution whose constituent values are x,, x 2 etc., a critical ratio
-
can be calculated as z
p)lcr. In this formula,
parent distribution;
and
Xj is the single
p. is
cr is its
the
mean
,
=
x3
.
(Xj
of the
standard deviation;
value with which
we
are con-
common
values, x and y, happen
If
1.
we can
p_>.
calculate
variance as follows:
we assume
that the population values
= p 2 the common mean for both samples is The p = (np, + np 2 )/2n = (p, + p 2 )/2. common variance of the difference in propor- p)/n. tions is 2p( for pi
.
I
2.
If
we do
values for p,
assume
not
=
p2
calculated as [p,(
I
.
that the
the
common
p,)
+
-
p2
(
The corresponding z val ues cases would be (p, - p 2 )/V'2p(
+
and
instance,
first
-
-
(p,
-
1
population variance
p2
for thes e I
-
is
)]/n.
p)/n
p 2 )/V[p,(
-
1
two
in the
p,)
p»)]/n in the second. In general, the
p formula for calculating the z value of a differ2(
ence
1
in
means
is
means
"standard error" of the difference c.
Regardless of whether a particular
critical
mean or for a difference in means, the z values have some distinctive, important properties. If we drew a large series of samples, consisting of single means or ratio
of
z is
calculated for a single
differences in means, and
were themselves
large,
and
if if
the sample sizes
a z score
were
cal-
culated for each sampling, the array of z values
cerned. If the data under consideration consist of
would approximate
means of samples, each of which has n members randomly drawn from the same par-
tion,
a series of
a
assume
not
w equals the sum
o\'
difference in
The simplest and most generally applicastatistical strategy rests on the idea of a
"'critical
mean
the
be summarized as follows: a.
we do
If
variance
be proportions, p, and
the
.
for the
it
particular,
sel-
is
negative or positive
associated with any
we assume, by
It
of the variance of x and the variance of y. In
to
however, the accep-
This
have a
population variances for x and y,
o\'
posm\e
y.
will
and y are equal, we can create
\
common
In general statistical usage,
dom
ances oi
then the
null hypothesis) as to gelling a
two ways.
o\'
equality
doubly negative phenomenon (rejection of the
—
x
own mean,
its
in the
seek the
hypothesis, that the population vari-
null
statisti-
is
=
w
variable,
"pooled variance" for w.
thai a bio-
also
new
he investigator thus regards a
impressive difference
cally impressive
done
usuall)
demonstrate
we
of two samples,
standard deviation that can be cal-
one
in
=
x, is z
analyze a difference
to
y,
which has
"common"
waj
culated
tor a positive
and
\
variable,
for the statisti-
investigators jn^\ statisticians use the inverted
ing.
we want
usually called Type
the difference in the
is
If
means.
is
any of these means,
n).
the general statistical
main reason
error. Probabl) the
- p)/{
too large
is
= 76%. We
23)
.15.] After
5
HA
P
1.08 and
[Doing the statistical procedure, we would choose p =
can be determined for any
327
side of statistical significance
"significant'.
Thus, for an observed value
a standard error.
of p 2
and the other
can represent an assigned
it
or for
size
the
error
= V(n 2 p iqi +
probability of falsely rejecting the alternative
nip 2 q 2 )/(n!n 2 ). For most practical purposes,
hypothesis.
we can assume
2. a.
of calculation for P B Equal sample sizes. Suppose an inves-
Illustration
tigator,
comparing the
tion with
that the rate at
rates of patient satisfac-
medical care
Hospital
A
at
and
84%
(19/23)
The investigator concludes is
not
two hospitals
of satisfaction was
statistically
70% at
finds
(16/23)
Hospital B.
that the difference
'significant'
because chi-
that VNpq = Vn 2 p7q7 + formula for finding beta error The n,p 2 q 2 - 8) Vnma/Npq. = (A would then be z B 3. The 'power curve' of a statistical test. .
If
we want to operate a for making
level of
a statistical test at a fixed
decisions,
determine the values of
/3
we can
that will
readily
be associ-
ated for any choice of a. In this case, the chosen level of
a
will
determine an assigned value of
Mathematical mystiques and
which
.
= Z\~ Zf w e
zB
that
z(
previously developed
we
since
be the location of
will
.
statistical strategies
in Fig.
formula
the
can substitute the
signed value of z a for zc and get z^
3.
=
+
za
assigned values of z
zB
counterparts
.
.
.
quently,
This
o{~
relationship
what we have noted and
of sensitivity test.
z,,.
/„ must decrease; and vice versa.
reciprocal
analogous
is
to
previous discussion'
in a
for a diagnostic
specificity
This reciprocal aspect of the equation allows various statistical
to
tests
be illustrated with
"power'" curves, which show I
- P
ti
that
values of
the
occur with different choices
will
of a. For example, consider the situation where values ol
the true
= = a
and
279; 18
=
and v
sample
compared
the
= 45',
p.,
2p(
-
1
.
rates arc
p,
A = p 2 - p, = 0.46. If we take
so that p)
/-
=
+
z„
and solve for
and z# for
and z B
z,
their respective
We
then would have
we
square both sides
Zp. If
we
n.
.
=
get n
and
=
p,
.50 and
significance for a
one-sided
level
(3
We
sample be? and
1.96
calculations,
the
=
2
/A 2
=
p2
.
.70
we know
z„
.20,
From
1.645.
previous
that v is either
1
0.48 or
0.47. Substituting directly
it
we
equation,
l.645) 2 /(.20) 2
(1.96+
z^)
A =
have assigned
Zp
cited
that
we want to attain statistical 2-sided a level of .05 and a of .05. What si/e should our
0.46. Let us call into
+
v(z,,
For example, suppose we expect
n
get
=
and n
,
=
(0.47)
152.7.
We
would thus need 153 patients in each group, for a total sample size of 306 patients. 2. Stricter procedure. In a mathematically stricter set
we make
of calculations,
in a
com-
the fact that the value of z„
= AVn/v =
(.18)
the null
80 for each group
size of
=A
Za
=
the other decreases
one increases,
If
a
and q. the values constant. Conse-
higher values are assigned to
if
the values
p.
to take the pre-
is
viously developed formula and to substitute the
as-
Since z± has a fixed value that depends on the
magnitudes of A n,. n 2 N. of /.., and zb must sum to
proach for these calculations
is
provision for
determined using j
we V80/.46 =
null
have
will
parison,
2.37.
hypothesis
zA
we decide
If
a two-sided
at
to
a
reject
level
the .05,
oi'
hypothesis, whereas the value of Zp
depends on
the alternative hypothesis.
we would have z B = 2.37 - 1.96 = 0.41. The associated P B value would be .341 and the 'power' o( the test would be 65.9%. If the null
hypothesis were to be rejected
sided a level of
.
1 ,
zB
=
2.37
-
two-
at a
1.645
-
V2p( At
same
this
test
the higher 'power' of
V[p 2 (l
of a 'doubly significant'
we
solve the
first
and substitute the
,
we worked on
In the foregoing discussion,
-
p)/n
A - p2 + )
c p,(l
-
'
p,)]/n
of these two equations for
we
results into the second,
c,
get
A - (V2p(l - p)/n)(z„) V[p 2 d - p 2 + p,d - Pi)]/n
.
)
the assumption that the research
We
was complete.
had our data; we had determined or
signed the level of P A or P„; and
know what P B might
be.
A
as-
we wanted modern
This equation becomes
to
different application
of these concepts occurs for the
^F {z/s [Vp
2 (l
-p
2)
+
cal-
Pl (l
-
Pl
)]
+ zjV2p(l -p)][= A.
culation of a 'doubly significant' sample size,
which means
enough
that
we want
a
sample large
to be significant at the levels
Squaring both sides and solving for
n,
we
get
of both
a and /3. In ,he previous calculations, we began w ith known data for everything except z B and we solved for z B Now we begin by knowing
n
=
(^{za[2p(l
- p)> +
z^[p 2 (l
+
,
-
p2 )
p,(l
2
-P>)] J
.
J
.
(or
assuming)
except n, and 1.
all
we
the
necessary
i
76.79L If
The calculation sample size
C.
I
Thus, the
level of
point c,
0.73.
and the associated P B value would be .233, giving the
a
point c in Fig. 3 will define an
information
solve the equation for n.
Simplified procedure. The simplest ap-
This formula, which looks,
is
statistical
is
less
formidable than
the one that regularly appears in
discussions 8, n
13 '
it
many
of sample size. In
]
Sample
and the other
size
numerical example just cited, the actual
the
values for our levels of
sample
nificant' 1
(,20)
+
1.%[2(.60)(.40)]*
2
l.645[(.70)(.30)
+
large as the
(.50)(.50)]4
To
2
a and
of z values will be
ratio
values are
329
side of statistical significance
be
will
(
the right
f3,
hand
and the 'doubly
1,
+
1
l)
2
=
sig-
4 times as
first.
suppose we want
illustrate this point,
achieve a one-sided significance level of .05
+
.96[.69]
!
1.645[.68]
10%
are
47
2
=
153.
.04
result
identical
is
what we obtained
to
with the previous set of simplified calculations.
Use of
3.
statistical tables.
tations
can be avoided
priate
sets
if
use of appro-
A
+
tables,
tained in
showing sample
textbook by Fleiss 7 For example,
5%,
a
if
.
(two-sided)
.05, Fleiss'
if
and
if f3
if
A
is
size for
a
in the
= 5%,
p,
to
2669
if
2
in
are higher than those calculated
preceding illustrations here because Fleiss'
Kramer and Greenhouse 9
,
the n values in our
calculations can be converted to the n' values
by
cited
Fleiss.
The formula
+
now
V
+
1
(8/nA)
To
We
formula for a 'singly
size.
signifi-
sample
cant'
z « 2v _ - -p.
„
n
136 patients
al-
a and
calculate sample size for both
levels of significance,
=
"
(z„
+ z«) 2 v = A2
/3
we would have
(1.645
2
+
l.645) (.25) 2
(.10)
=
We
=
would therefore need 2 x 271
tients,
an amount that
is
270.6.
542 pa-
about four times larger
than before.
Two
other important features to be noted
A
about these formulas are the crucial roles of
and 1
,
v. Since
A
is
A2 A =
the value of if
the value of
and a 'doubly significant' sample
In the classical old
=
always a value between is
and
always smaller than A. (For
.3,
A2 =
.09.) Furthermore,
the smaller the value of A, the smaller will be
note the difference between calculating
a 'singly'
67.65
2
(.10)
together.
example,
is
Differences in sample-size methods.
4.
"
and we would need 2 x 68
analogous to that of the Yates' 'correction'
chi-square tests. Using a formula derived by
can
A
(1.645) (,25)
=
2
pi
computations include a "correction for continuity',
0.26.]
we would have
level significance, z v
(one-sided)
= 50%. If A is 20%-, under the same conditions of a and /3, the size of n will range from 99 if p, = 5% to 172 if p! = 50%. The values of n in Fleiss' tables
=
For the 'one-sided' calculations of sample
sizes
Table A-3 shows that the size of n
range from 796
will
.05,
is
0.25. [The alternative es-
/3,
cellent
is
=
(.I0)(.90)
timation for v would be (2)(. 15)(.85)
particularly
A, and pi, is conTable A-3 (pages 176- 194) of the ex-
for different values of a,
in the
From
2
good collection of
20%
group and
these data, p 2 = .20, p, .10, and v is estimated as (.20)
A =
10,
(.80)
These compu-
we make
of prepared tables.
.
the expected rates of success
in the control
treated group.
= The
where
clinical trial
.04
= -^1.35 +
to
in a
A2
and the larger
will
2 responding value of (1/A ) which
be the coris
used as a
factor in determining n. Thus, the smaller the
we want
difference for which significance'
,
the larger
is
the
show
to
'statistical
sample size
that
is
we want to prove the null hypothesis exactly, and to show that pj and p 2 are absolutely identical, we would need a sample required. In fact,
if
of infinite size because
A =
0.
Since v appears in the numerator of the factors
For the 'doubly significant' sample,
that are multiplied to calculate n, the size of n
The change
to
(z a
=-
z„]
,
which
is
Ar
1
by a
+ Ml 2
n
2
z«) v
will decrease as v decreases.
.
The value of
v,
— p), will be at a = 50% and will take mini-
being dependent on 2p(l
'double significance' thus in-
creases the sample size 2
+
ratio of [(z (( 4- zp)/ If
we choose
maximum when
mum will
0%
or
p is close to 0% or to 100%, v be small and n will be correspondingly
100%. Thus, equal
p
values near the polar extremes of if
Mathematical mystiques and
statistical strategies
the other hand, when p is very close extreme, a large value for A may be a polar
On
small. to
The
difficult or unfeasible to obtain.
extremely
cology
.
occurs for
of the bioequivalence of
tests
two pharmaceutical preparations.
In such cir-
A
cumstances, we assign a value of
as the maxi-
advantage of a very high or very low value for p ma) thus be completely obliterated by the asso-
mum permissible difference between the groups.
ciated disadvantages of a very small value for A.
shall
If
the observed difference
equivalent.
The importance
D.
Because
fi
error
scientific research
showing
at
of
that
investigators
two
chosen for
usuall) directed
is
most
entities are different,
depend on
The values oi P B
v
are
generally omitted, cither because the investiga-
unaware o(
is
tor is
attention to the possibility lent
to setting
value of
The absence
'power' of the
o\' fi
error
= =
value of z^
the
one-sided
the
/,;.
because he
their existence or
not concerned about them.
test is
P|,
- P„
I
or
is
o\'
equiva-
As noted
A and
mining sample
routine studies
a and
cited earlier, for
84
2 1 )'
(
can emerge from
problem
both equal to .05 and
fi
from
according to the values of p,. Since sample sizes of this magnitude will usually be to 143,
For
and the
unfeasible, the values of
50%.
In other
be
made
to
t).
a and
quite liberal. Thus,
committing the
15, the size
if
may have
f3
a
(two-sided)
(one-sided)
if (3
is
to is
increased
of n will vary as follows:
rejecting the alternative hypothesis
Most investigators accept
this risk
with equa-
nimity, since their main concern in the custo-
mary
significance' testing
situation o\
with a false positive conclusion. There
error
A
5%
5%
Pi
57c
509c
20% 5%
20% 50%
n
254
72S
38
57
with
is
These sample
sizes, although smaller than be-
two major scientific circumstances, however, in which the role of fi error becomes
fore, are
still
patients
whose data have been customarily
particularly important.
amined for studies of
are at least
The trial
of these circumstances
first
which we want
in
satisfactory
when
to
a clinical
is
be sure of having a
a substantial A we fail to find "statistical siga level, we might like to be
chance of detecting
exists. If
it
nificance' at the
reasonably confident about accepting the null
This
hypothesis.
strategy
responsible
is
sample-size calculations that culminate phrases as
909c chance of finding a
"'a
in
20%
for
such dif-
ference at the .05 level". In this phrase, the associated
=
in
example
the size of n could range
.
increased to 0.2 and
a
that
0.
false negative error of incorrect-
in deter-
f3
bioavailability. In an
o\'
words, the investigator takes a 50-50 chance of
ly
a and
size.
The high values of n
A =
is
p, will be as
these calculations will be a major
for
we
the value that
earlier,
magnitude of
the
.5;
this
smaller than A,
important as the choices of
statistical tests that pro-
vide values onlj for P
is
conclude that the groups are essentially
statistical
.05 and (one-sided)
values (3
—
are .
A =
a
.20,
1.
substantially larger than the 6 to 10
availability research
over" manner,
in
ex-
bioavailability. If the bio'
conducted
is
in a
"cross-
one group of patients rather
than two groups, the paired arrangement of data will
permit a further reduction
Nevertheless,
come demanded
size.
standards be-
for studies of bioequivalence,
the problems of obtaining
people for the
sample
in
statistical
if strict
tests
may
ample numbers of
be so formidable that
the studies will be impossible to conduct. Just as the old calculations of
made no
provision for
tions of
sample size of
f3
PA
a
for
error alone
new calculaalone may have
error, the
(3
error
j
The second (and perhaps more important) role of to
error
{3
show
that
different.
is
in the situation
two groups
An example
clinical trial
12 -
15
of such a situation
whose conclusion was
is
a
that the
quality of primary care provided by nurse prac-
what
is
by physicians. Another example, which creasingly
common
be done without consideration of
a
error.
where we want
are similar rather than
titioners is essentially equal to
to
situation in clinical
offered is
an
in-
pharma-
E.
A
Caveats and abuses knowledge of
f3
reasoning and the 'other
side' of 'significance'
can lead to prompt detec-
tion of a classical abuse in the cal tests are often reported in
The
f3
way
that statisti-
medical
literature.
reasoning has also been applied to create
new problems
in the calculation
of sample size
Sample
and the other
size
Conclusions when the null hypothesis
1.
is
side of statistical significance
difference
is
In a routine statistical test of 'sig-
we have found
nificance*,
what conclusion do we draw
treatment
P A value
higher than
is
question
this
is
that
a? The
the
if
answer to
correct
such a high P value makes
us concede, i.e., fail to reject, the null hypothesis.
With
this
cant.
that the
statistically not signifi-
is
The wrong answer
we accept
we conclude
concession,
observed difference
distinctions
by chance
A
56%. Thus,
if
the true success values for
and B were, respectively. 29% and if
we exchanged one success and comprising groups
failure in the patients
and B, we would get success
and conclude
that
for
A
rates of
and 5/9 (567c) for B. This difference
One
between concede and accept
be clinically illustrated by recalling the purpose
gator
as a diagnostic test.
We
order the test in search of a positive diagnosis of
lung cancer. If the test
we cannot conclude
We
ruled out.
have failed to
that
negative, however,
is
lung cancer has been
would merely concede that we show its presence. To accept the
negative diagnosis that lung cancer i.e., to rule
sults
out,
it
we would want
from additional
is
absent,
check
to
re-
such as the chest
tests,
X-ray.
Consequently,
in a
significance', a high
simple
P value
verdict of not proved.
of
test is
'statistical
like the Scottish
When P A
exceeds a, we
it.
or
fail to reject
must therefore be is
that the
it.
it
to accept the null
To
we would have
— a decision
that
would require additional evidence for the possibility
of
/3
the erroneous conclusions that
>
a.
The
but
we would
can occur when
investigator wanted to claim that
the satisfaction
was similar
at the
two
hospitals,
not accept his claim because
a 'power' of only 71%. In fact,
it
had
our original
if
sample size was quadrupled, and tion of successes
if
the propor-
remained the same, the
result-
numbers would be 64/92 vs. 76/92 and the difference would be statistically significant at ing
P
<
.05 even though the observed 8
was only
two groups
Even when the two contrasted
we
still
is
and
'insignificant'
that the
should be regarded as similar. reasoning were correct, that
>
a,
identical,
groups type of
If this
we could always
two treatments were
'prove'
merely by
using a small sample size for the study. Thus, if
we
put 3 patients in each group, a result as
extreme as 0/3 (0%) successes for treatment A vs. 3/3 ( 100%) for treatment B could still not achieve
'statistical significance'.
P value
is
.
.)
1
From
tical significance",
(The two-sided
this failure to attain 'statis-
it
would be absurd
clude that the observed difference
medical
An
to con-
insignifi-
is
literature.
analogous problem occurs when
tests are
done
results
seem
cannot conclude that their
statistical
whether the act of
to determine
randomization provided an equitable distribution of the patient
groups before treatment began
When good
grounds exist for
suspecting baseline inequalities, a high
P A value
cannot alone be accepted as confirmation of their absence.
The analysis
rable
incomplete un-
is
(A memoexample of such omissions occurred in
less attention is also
analyses'diabetes.
l7
given to P B
of the celebrated
When
statistically
ences were not found baseline
distinctions,
in
the
.
UGDP
study of
significant differ-
certain analyses of
data
analysts
con-
cluded that the baseline differences were insignificant,
although no levels of /3-error were
cited.)
The
13%.
quite similar,
investi-
PA
concludes that the observed difference between
in a clinical trial.
error.
The previous example of satisfaction with care at two hospitals provided an illustration of PA
when an
value, i.e..
Nevertheless, such errors regularly appear in
observed difference
hypothesis
P
cant and that the two treatments are equivalent.
insignificant,
is
gets a high
We
not significant, rather than insignificant.
conclude that
who
Our conclusion
neither reject nor accept the null hypothesis.
concede
is
of the main abuses of tests of statistical
significance occurs, therefore,
sputum Pap smear
A
2/7 (29%)
impressive although not statistically significant.
and between not significant and insignificant can
of the
for
.
that
the difference is insignificant.
The
(43%)
)
is
to the question
the null hypothesis
A
treatments
one
a success rate of 3/7
and 4/9 (44% for treatment B This seems unimpressive, but it could readily
result
arise
Forexample. suppose
insignificant.
conceded.
331
nary
point to be borne in
test
of
mind
is
'statistical significance'
that
an ordi-
can be used
only to reject the null hypothesis, not to accept
Mathematical mystiques and
either
test
shows
difference.
ificant' 1
nificant
difference.
we would need
sion,
To draw know
\alue for the possibility oi
Problems
2.
the latter conclu-
the other kind ot
(3
P
sample
size.
With the increasing performance oi controlled clinical trials, man) alternative strategies have been proposed for determining sample si/c So
main
different proposals
seems
si/e
have been made,
to
have become
a
sport of statistical theoreticians strategies include
techniques
'play-the-w inner'
art in
The
Like main other with
wisdom"
mathematical
a pre-
tactics
of
and investigators from basic chal-
lenges that are the reall) fundamental issues in scientific research. In order to calculate a
we
ple size,
sume the
have been taken care
Neyman-Pearsonian.
strategies
sam-
often ignore these issues and as-
that the)
have yielded
a
tered by
its
Bayesian.
number
negotiations
its
choosing sample size
no
exist
procedures for either multivariate
in a
manner
or preparing a clinicall) effective composite of
important multiple variables into a single univariate index.
hard data
The Units on
h.
thing
in the
sample
Since every-
.
si/c calculations
depends on
is
an item oi 'hard data", such as death. Since
changes
in
death rates are usually smaller than
changes
the
can occur
that
in
important 'soft
on hard
data' variables, the result of the focus is
small value of A,
to create a relatively
which may lead
to excessively large values for
A more
the calculated sample size.
important
consequence of the hard-data focus
is
that
an
important soft-data variable, such as vascular
complications or quality of
may become
life,
ignored in the early stages of biostatistical planning for the
trial
and may remain ignored (or
attention to 'soft data', the most important clini-
other
sample
may
number
magnitude ('how the devil can about
currently
there
biostatisticaJ
or
in the
the precision of the
As
Nevertheless,
poorly managed) thereafter. Because of this in-
('you will need exactly 984 patients') or flusI
number and reduction become the
possibly get so many?').
ment.
satisfactor)
of. After
size calculations, the clinical investigator
become awed by
are involved in a patient's responses to treat-
data
determining sample si/e ma) often distract both statisticians
calls
cians usually want to be sure that this endpoint
statistical activities,
the
of good clinical investigation, which
tenet
on
contrary to every
is
Schneiderman" summary oi the
sented here has been based on the currently
occupation
outcome
the endpoint noted in a single variable, statisti-
sonic of the statistical ideolo-
accepted "conventional
"end-
will be the
various
For practical purposes, the material pre-
gies
whose outcome
the research. This concentration
in
onl) one kind oi
alternative
and
conjectures,
has provided a well-written
oi the
indoor
favorite
schemata based on sequential
Bayesian
analysis,
state
in
contriving new ways to gauge sample
fact, that
single variable
point"
tor an appraisal of the multitude of variables that
error.
calculating
in
a 'sig-
cannot show an 'insig-
Ii
to
show
or docs not
statistical strategies
this
cal
and human aspects of therapy
—
the associ-
ated risks, benefits, costs, joys, and sorrows of
—
treatment
become
often
the research
grossly neglected in
16 .
Bec. The current uninformed choice of p cause good data are seldom available for 'histor.
x
ical controls', the
choice of p, (as an estimate
focus of attention, the clinician and statistician
of the outcome rate for the control group) be-
may
comes an
forget that the basic scientific problems
remain unresolved.
Among them
are the fol-
guesswork
that often turns out
If the error leads
overestimate of sample size, the
lowing: a.
act of
to be erroneous.
The univariate choice of an endpoint. To and A. we must choose a
determine p,. p 2
.
huge
to a
trial
becomes
excessively expensive. d.
The future uninformed choice of p
{
.
The
estimate of a single value of p, has no real clinical precision. *For m\ education in these concepts and tor cither helpful comments oi, this text. am indebted to several clinical and statistical colleague Donald Archibald. Robert Deupree. Michael Gent.
stead,
is
What
is
usually needed, in-
a series of p, values
—one
for each of
I
Walter Spi knowledges., the contents
;r.
and Carolyn Wells. Their aid is gratefulh acwhile thej are also absolved of responsibility for
the cogent clinical strata to therapy. If the data
therapeutic
trials are
2
of patients subjected
of large-scale randomized
not analyzed with a cogent
Sample
however, the
clinical stratification,
current
results of a
cannot provide a good estimate of
trial
Today's expen-
p, values for use in future trials.
unproductive therapeutic
sive,
and the other
size
trial
may
thus be
know what we're doing and even if we can't it, repeat it, or make good clinical sense out of it, we can still calculate the required specify
populational numbers and determine the prob-
followed by tomorrow's.
abilistic
The arbitrary choices of a and f3. Despite the elaborate reasoning that has been discussed
directions.
e
.
a and
for choosing
fi,
selected in the abstract intellectual
What
scribed here. statistician
A. to
the
is
(
is
patients can actually be obtained for
and
tigation
can be funded. The values of
(2) that their recruitment
fit
this
able mathematical rationale
1.
2.
a and
/3 3.
a suit-
in
is
differ-
4.
A
importance, the scope of
by the univariate
strained earlier,
but
5.
noted
restrictions
ments about the proper size of
A
The University Group Diabetes
A
mortality
findings,
further statistical analysis of the J.
M.
A.
217:1676-
A.
Feinstein, A. R.: Clinical biostatistics.
The
macol. Ther. 13:285-297, 1972. A. R.. and Ramshavv, W. A.:
Feinstein,
have received
role of randomization in sampling, testing,
and credulous idolatry (Part 2). Clin. Pharmacol. Ther. 14:898-915. 1973. Feinstein, A. R.: Clinical biostatistics. XXXI. On the sensitivity, specificity, and discrimination of diagnostic tests, Clin. Pharmacol.
6.
Feinstein, A. R.: Clinical biostatistics.
XXXII.
Biologic dependency, 'hypothesis testing', unilateral probabilities,
workshops, or other conclaves of experts as-
direction vs. statistical duplexity, Clin.
sembled
macol. Ther. 17:499-513. 1975. Fleiss, J. L.: Statistical methods
of clinical impor-
tance. In the absence of established standards,
on being badgered by choose a A so that sample
7.
reasonable value. This value
formula, using z„, zp, etc.
emerges
that
is
is
8.
A
unfeasible,
accordingly, and so
do a and
adjusted n
some brave new world of
comes
the future,
when
peutic trials be truly clinical investigations as
cal investigators
em
statistical
/3-error,
mod-
methods for determining a-error,
and sample
size.
Even
if
we
don't
Chron. Dis.
J.
S.
W.: Determi-
search Council, 1959, pp. 356-371. Neyman, J., and Pearson, E. S.: On the use and
of
20A:l75and 11.
Pasternack,
of certain statistical
test
criteria
inference.
for
the
Biometrika
263. 1928. B.
S.:
Sample
sizes
for clinical
designed for patient accrual by cohorts, Chron. Dis. 25:673-681, 1972.
trials
can take comfort in knowing
about the panacea-like marvels offered by
and Ederer.
.
Kramer, M., and Greenhouse.
purposes
be developed for these clinical clini-
J.,
sizes for medical trials with special
interpretation
well as elaborate exercises in mathematics, bet-
may
&
Cole, J. O., and Gerard, R. W.. editors: Psychopharmacology: Problems in evaluation. National Academy of Sciences, National Re-
10.
and scientific problems. In the meantime,
for rates and
in
clinicians begin to insist that large-scale thera-
ter solutions
Phar-
nation of sample size and selection of cases,
out right. In
Sample
21:13-24, 1968. 9.
/3, until
in scientific
1973. John Wiley
reference to long-term therapy,
sample size
gets
York.
Halperin, M., Rogot, E.. Gurian. F.:
tossed into the
If the
and other issues
Sons, Inc.
can be calculated, picks what seems like a
size
New
proportions.
the clinical investigator,
the statistician to
A
Feinstein, A. R.: Clinical biostatistics. XXIII.
almost no concentrated attention via symposia,
to adjudicate matters
pur-
Ther. 17:104-116, 1975.
also gets chosen arbitrarily. Judg-
it
J.:
Program.
allocation,
not only con-
is
Cornfield,
The
often the most crucial issue
planning and evaluating the research. Despite
this
on such dazzling
procedure for rapid mental calculation of the fourfold chi-square test. J. Chron. Dis. 25:551-
ence that indicates 'clinical significance', the
A
to rely
553, 1972.
The arbitrary choice of A. As the
magnitude of
been able
poses of prognostic stratification, Clin. Phar-
for presentation to the granting agency. /.
logical
1687. 1971.
then developed
is
both
in
References
and inves-
number and
going
Not since the days of alchemy have
transmutations.
then chosen
the trial
are then adjusted to
scientists
that the selected
1)
uncertainties
de-
that the
and investigator decide on the size of
two requirements
number of
manner
often happens
The magnitude of the sample fit
seldom
their values are
333
side of statistical significance
J.
12.
W. O., Gent, M., and The Burlington randomized nurse practitioner: Health outcomes
Sackett, D. L., Spitzer,
Roberts, trial
R.
of the
S.:
Mathematical mystiques and
patii
its,
Ann.
Intern.
Med
statistical strategies
randomized
80:137-142.
Engl.
1974.
hlesselman, studv.
14.
I.
J.
Sample
Planning
a
longitudinal
Dis. 26:535-560, 1973. Schneiderman, M. A.: The proper clinieal trial:
'Grandma's
Nev. Drugs 4:3-1 15.
J.:
si/e determination.
1.
strudel'"
J.
W. ().. Feinstein. A. R.. and Saekett. What is a health eare trial? J. A. M. A.
of a
method.
J.
17.
University
Group Diabetes Program. A study
the effects of
complications
.
:
of the nurse practitioner. N.
233:161-163. 1975. size
1964
W.
M
trial
Med. 290:251-256. 1974.
Spit/er.
D. L.:
Chron.
O.. Saekett. D. L.. Sibley. J. C. Kergm. D. J.. HackRoberts. R. S.. Gent. ett. B C, and Olynich, A The Burlington Spitzer.
16.
J.
betes
and
1.
II.
of
hypoglycemic agents on vascular in
patients with adult-onset dia-
Design, methods and baseline results; Mortality
21:747-830. 1970.
results.
Diabetes 19 (Suppl.
CHAPTER
23
Problems
in the
summary and
display
of statistical data
After completing a research project, an investigator encounters several different challenges
management of
in the
lenges
mation; another third
One
is
in
of these chal-
organizational
activities
of
consist
suitable
arrangements of the data for each
and summarizing the data
in a
vari-
way
that
allows appropriate discernment of relations and contrasts. in
The formation of conclusions occurs
two different steps. From knowledge of the
scientific
search,
background and architecture of the
the
of the observed relations
uses
and contrasts, and
about their substantive
first
make decisions importance. From the
judgment
scientific
re-
notes the magnitude
investigator
to
statistically
pleted.
To meet
must provide
Thus,
an
investigator
study, a difference of 309c in
percentage improvement with
find,
in
one
comparing the
two treatments
in
results is a challenge
this challenge, the investigator
scientific colleagues with a clear
what was concluded. Unless
this last
challenge
managed, the previous activities will have an unsatisfactory outcome. The research will not be reported in a manner that makes it is
suitably
comprehensible, appraisable, and usable by the scientific
The
community.
tools of statistics play diverse roles in
The basic
these activities.
architecture
7
of the
research and the choice of important variables scientific rather than statistical deci-
but
statistical
the
all
other
procedures
involve
methods. The techniques of descrip-
tive statistics
might
"sig-
account of what was done, what was found, and
sions,
"statistical significance".
statistically
occurs after the others have been com-
depend on
decisions about
was
The communication of
examined groups, the investigator then uses
make
clinically important al-
"not significant", where-
nificant" but clinically trivial.
observed magnitudes and from the size of the
mathematical inference to
was
as the cited correlation
that
results.
choosing the variables to be analyzed; preparing
able;
though
drawing conclusions; a
communicating the
is in
The
data.
organizing and analyzing the infor-
in
is
apeutic difference
provide expressions for the num-
bers that are used to summarize data, to
and
relationships,
indicate
to
show
contrasts.
To
summarize data for individual variables, de-
10 people; or, in another study, a correlation
scriptive statistics offers such numerical expres-
two
sions as means, medians, proportions, standard
coefficient of
variables in
.05 for the relationship of
5,000 people. Using both
scientific
and mathematical methods of decision-making, the investigator
might conclude that the ther-
This chapter originally appeared as "Clinical biostatistics
XXXVII. minuses,
Demeaned inefficient
ruptions of scientific Ther. 20:617, 1976.
—
confidence games, nonplussed coefficients, and other statistical dis-
deviations, ranges, and percentiles. the relationships tive statistical
among
To show
variables, the descrip-
methods include two-way tables
and graphs, correlation coefficients, and regression equations.
To
indicate contrasts, descrip-
errors,
communication." In Clin. Pharmacol.
tive statistics supplies
such expressions as
in-
crements, decrements, ratios, and proportionate
335
Mathematical mystiques and
statistical strategies
The techniques of inferential starts-
nces.
provide
methods
mathematical
the
for
probabilistic conclusions about the re-
rig
The
sults cited in the descriptive expressions.
mathematics of inferential
produces
statistics
your work.
in
confidence intervals, and the various correlation
tor
values
regression
or
and tor the differences found
coefficients
in
win
scriptive
The descriptive results nt
foi
ol
idea,
None of
scriptive processes are completed,
or scientific
descriptive data arc
the in-
tial
role
ol
unless the
bi-
edi-
have become so infatuated with inferenor analytic statistics that the fundamental
of descriptive
has
statistics
glected or per\erted.
become neoccurs when
The neglect make important
investigators and editors
deci-
on the basis of P values or other
sions soleh
inferential calculations, ignoring the
magnitude
of the contrasts and relationships from which the statistical calculations are derived.
version
occurs
when
descriptive
The
per-
expressions
and relationships are elimi-
for the contrasts
concept
flic
inference and pertinent only
tical
nated and replaced by the inferential statistical
The idea
a.
want
oferror.
to explain the
mean, the idea
The demeaned error
problem
you decide
ving
test.
critical
idea o
Choose an but
common
statements you foolish.
Now
exactly what
is
make try
to
to take the
intelligent,
who knows
who
rithmetic average or
has enough
tin
you
invite
"layperson'*
statistics,
seem
explain
meant by
the
rea-
no-
understands the
mean, and who
sense to question that
you
that
to
begin
is
the
It
that
arose in reference to a practi-
today would be called "ob-
server variability". Suppose the
same
entity or
substance has been measured several times and
suppose
at
your institution
conducts
may
measurements do
different
the
agree. For example,
all
if
is
not
the chemistry laboratory
particularly fastidious and
tests in quadruplicate, the lab
its
get such values as 249. 250, 247,
mg/dl as measurements of cholesterol concen(ration in the
any
inconsistent or to
that
person
standard error of
an and why you, as an investigator, use
it
I
and 258 |
same specimen of serum. Which
one of these measurements should be issued by j
the lab as the formal, correct value? to this question involves
some
in the statistics
we average all the valmean? If so, the "corShould we take the average
of mensuration. Should
The "standard error" has become a popular method of reporting results, although most of the investigators using this term do not know its definition, source, or connotations. If you doubt I
certain
This idea happens to have a
realistic origin.
cal
If
fundamental philosophic issues
he foregoing remark.
when
phrase standard error of the
word with which
first
of error.
The answer
statements. 1.
an abstract
is
b\ the imaginary world of statis-
scientific reality
modern
of
spawned
and none
available lor evaluation. literature
data,
scientific
operations of that imaginary world are met in
acceptability
the
in
and
fantas) are described in such
o\'
until the de-
omedical science. man\ investigators and tors
the
maneuvers have anj substantive
Nevertheless,
realistic
in abstract fantasies
standard error has nothing to do with stan-
\
procedures are ob-
research
scientific
meaning
the flights
communicating
statistical
maneuvers can be applied
the inferential
engaging
dards, with errors, or with the communication
vioush basic necessities
ferential
why
friend will be asking you
peculiar winds as standard error.
contrasts of means, proportions, or other de-
summaries
the
If
you should now be feeling quite uncomfortable.
Your
P
satisfactory
satisfactory to that other person.
is
scientists arc
yield
is
battled and seeking a clearer account.
explanation
such probabilistic statements as standard errors. tests that
the explanation
If
you, the other person should be looking
to
ues together and take the rect" result
is
251.
of the three values that remain after discarding the value of 258 because lier, far
result
away from
is
248.7.
it
seems
to
be an out-
the others? If so, the correct
Should we take an average
based only on the two closest values? correct
result
is-
249.5.
which and how many values
yond
sion,
to include
is
the scope of this discussion. For the
ment, the point to be noted
how
If so, the
The decision about
is that
be-
mo-
regardless of
the candidate values are chosen for inclu-
each of the "correct" results emerged
from calculating
the
mean
of the candidates.
Problems
The idea of using the mean by
tified
its
problem
— and
is
tradition
— since
is
summary and
is
jus-
between the mean and the actual
and non-pejorative.
name
property
a
of
adopted the idea ideal.
The
measurements
individual that the
to
mean was
statistical glorification
being
Quetelet
people.
also the
of mediocrity
people was perpetuated when Galton, Pear-
in
biometry accepted both the conceptual transfer
for their failings.
If
can be suitably castigated
They can be
Although modern recognition of
is itself
and improper. To
call a de-
viation an error implies that there is
something
untrue, incorrect,
wrong with the measurement,
that the equip-
ment or the observer (or both) may not have
this
folly
should have been the main reason for evicting the
word error from
its
former role
describ-
in
ing deviations, the second reason
probably
is
more cogent. The word error was needed
we
(as
shall see later) for a different job, describing
fact,
however,
a different type of deviation, occurring in a dif-
we had some unequivocal method
for deter-
ferent type of
been functioning accurately. In if
of variance and the associated nomenclature of error
called errors.
This use of the word error, of course,
modern world
— the abstract
im-
mining the accurate value of the measurement,
agery of
we might have found that it was any one of the "'deviant" results. The deviation may thus have had nothing wrong with it and its designation as an error made it a victim of a scientifically
h. The idea of 'standard'. Once we have decided to use the mean as either the correct
bizarre
"morality"
less, this
in
nomenclature. Neverthe-
demeaning use of error has persisted vocabulary, being embellished
•in statistical
in
such additional maledictions as error variance
error for the
[or residual
deviations around a fitted iations of
v
sum of the squared mean or for the de-
observed measurements from a
fitted
For
many
other
statistical
circumstances,
•lowever, the word error has been displaced
rom
Yom
statistical inference.
result or the focal point of a series of n
this unsatisfactory
usage; and deviations
mean are actually called deviations. At two reasons can account for the displacenent. The first reason is that the word error is the
east
ibviously
foolish
lations from the
when
it
mean, not
is
applied to de-
in different
mea-
we want
surements,
vious approach
sum of
the deviations
of the
about
way this
a
the
serum cholesterol of each
group of people, calculated the
nean cholesterol for the group, and then found '.ach
person's deviation from the mean,
vould be unacceptably silly
if
we
we
referred to
To
a futile task,
take the
however,
=
^Xj/n, where x
then £(Xj
—
expanded
to
nx
=
wonder mean is de-
they were defined. [If you
n measurements. The x),
sum
is s
any one of the
of the deviations
is
which becomes algebraically
^x — £x, (
which
is
nx
—
0.]
To get around this problem, we could take the sum of the absolute magnitudes of the deviations, regardless of whether they are positive
This sum, when divided by n,
or negative.
value
we determined
is
statement, recall that the
neasurements of different specimens. After t
The most ob-
average dispersion
since they will always add up to zero, by virtue
would give
nember of
to find the
or average deviation of the values.
surements of a single specimen, but in single all,
is
mea-
to get an idea of the disper-
sion of values around that mean.
fined as x
line.
l
lay the foundations of populational statistics. In
mean, however, has been endowed with the priety, the deviations
'
a basic principle of
son, and other founders of the British school of
straightforward
sublime virtues of truth, correctness, and pro-
:
was
the
is
:
it
a property of individual
mean, we can note
a
these differences deviations, the
call
today, but
silly
transferring deviations and variance from being
values that are observed in the measurements. If
we
seem
way of solving the more than a cen-
tury of scientific usage.
the differences
those deviations as errors. This approach ma\
the reasoning used by Quetelet a century ago to
there
sanctified by
Once we have chosen
337
display of statistical data
seems
reasonableness
be no better routine
to
as the right result
The
an old statistical tradition.
in the
that
the average absolute deviation is
clear
and reasonable.
—
Unfortu-
nately, because absolute values are a nuisance to calculate ically
and work with, they are mathemat-
unappealing.
The next option field.
To avoid
is
the
the one that has swept the
negative
signs,
we can
Mathematical mystiques and
all
e
deviations. Preparing for getting
tlie
we now add
rage,
2
squared de-
those
viations together, as classical!) x)
it
and is 1
symbol, S xx
a special
to call this shall
sion by dividing the standard deviation by the
to be verj
sum of the squares, (Mj own preference
mean
get the coefficient of variation.
to
thermore, that
953
conventional nomenclature
man)
years,
municating
i
we now
divide the
viations bj n. w Inch
the
is
we
(or observations),
sum
get
the squared de-
i^\
number of deviations the mean squared de-
o\ the
standard
1.96)
to the
data will be contained
o\ the
spanned on either side tually
adhere
summary
a
(ac-
Thus,
for
deviations.
most popular way
the
~
I'Xi
for a set
com-
oi'
univariate
in
the form
mean ± standard deviation, usually svmbolized as x ± s. [There are good reasons as
oi
it
1
"
—
—
for rejecting
with medians and per-
x)'
centile ranges; but this essay
n
o\'
data has been to cite the results
this fashion, replacing
J
/one
in a
mean by two
discussed earlier in this scries
viation.
Fur-
Gaussian circumstances, we know
in
expression the group variance, hut
in this essaj If
.
can
sum
the
is
manipulations and so
often gets a nickname,
of the
We
dispersion oi data around the mean.
also describe the dispersion in a single expres-
of the squared deviations, happens useful for other statistical
summary
deviation serves as a splendid
symbolized by
This expression, which
.
statistical strategies
concerned with
is
n
the sins of the standard error, not the standard
which
also called the variance
is
variance,
square root of
tlie
rate!) called a
root
we
B) taking the
get
w
hat is accu-
mean squared deviation. To
shorten the phrase, the term might be called the
The phrase
average square deviation. used, however,
actuall) I
have been unable
question of whv like
is
an answer for the
word
scientific
standard was pressed into
mathematical
is
standard deviation.
to find
distinctive
a
that
peculiar
this
(According
F.
N.
David, the phrase standard deviation was
first
service.
Man)
to
deviation.]
The entrance and emergence of indirect The events just noted were routine
e.
inference.
of
procedures
evolved tigators
were
descriptive
the
the
in
days
when
concerned
with
what they had found. These
statistics
that
inves-
scientific
demonstrating
statistical activities
required no knowledge of mathematics (beyond the
ability
do arithmetic) and no mental
to
Mights to any probabilistic aeries. tigators
When
inves-
performed comparisons, however,
a
other two-word
role
became
alternatives were available, including adjusted
The
investigator's comparison might involve a
deviation and adapted deviation.
direct contrast of
used by Karl Pearson.)
mean squared
suitable
word dispersion,
the root
or an indirect contrast of a greater-than-zero
deviation could have been called
value for a correlation coefficient or regression
or
the
mean
The
dispersion.
word variance might even have been reserved for this purpose, so that what is now called the variance
or
the
squared standard deviation
could have been called the squared variance.
With complete disregard for the important scientific roles of the
words standards and stan-
however, the idea of standard was
dardized,
seized and joined to deviation, where
mained
made
it
has
re-
in its status as a statistically fused, sci-
malformed
entifically
neologism.
What
the malformation so acceptable
phrase
standard
course
is
excellent
data have
deviation
so
that the idea (not the
way i
two means or two proportions;
B)
re-definition of the
the dispersion
available for statistical inference.
has
and the
assumed
coefficient against an
null value of 0.
After deciding, from scientific judgment, that the contrasted
results
were substantively im-
pressive, the investigator
would then want
to
determine whether the observed groups were large
enough
for the results to be
more than
a
chance occurrence.
The need brought
to
make
statistical
search.
The two
do the
probabilistic
previously
this probabilistic decision
inference into scientific
different mathematical
analysis
in this series.
ways
re-
to
were described 9
One way was
simple,
popular,
of
straightforward, and easily comprehensible.
It
name)
an
relied
on permutations of the observed data
to
of communicating results.
is
If the
Gaussian distribution, the standard
provide a specific distribution of alternative possibilities for arranging the results,
showing
Problems
exact P values for each alternative.
in the
The
other
was complex, convoluted, and hard
wa\
understand.
on a gerrymandered
relied
It
to
trans-
they
and
Republican,
vote
will
thereby conclude that the Connecticut vote
54%
favor of Republicans will be
in
next
in the
domains of hypothetical
thereby conclude that molecular biologists are
numerous assumptions
think in the abstract
tions,
ances, and approximate P values.
The
first
way
election,
more
before computers,
first
required
way,
in the
and
difficult
we perform
epidemiologists,
For parametric estimation, indirect inference ury.
is
an estimation procedure,
In
way, once one learned the rules of the game,
group available with which
was easily carried out with
tations
held) calculator.
Not because of any logical or but merely
desirability,
scientific
because of
way has
calculational convenience, the second thus far been triumphant. In the forseeable future,
digital
com-
have become cheap and easy to use and when hand-held, battery-powered computer terminals have become ubiquitous, the scientif-
methods of
desirable, direct
icall)
may become
statistical in-
the conventional proce-
dure for performing probabilistic contrasts. For
immediately foreseeable future, however,
the
must deal with the indirect forms
investigators
of probabilistic reasoning, with statistical consultants
cause of its
who its
prefer that
comfort,
its
form of reasoning familiarity,
(be-
and perhaps
mystery), and with the adverse consequences
that the
indirect
forms of
one sample. There
statistical
inference
vestigator must engage in educated
statistical in-
ference had to be altered from their original
and
population; and must accept the other theoretical
components of
observed values for
from
the process that leads
in a
sample
to estimated values
frame (or population).
its
The distinction between a parametric estimation from a single random sample and a probabilistic contrast for the results of two groups does not appear
many
sequent confusion
many
to
be clearly understood by
and
statisticians
may
scientists.
of the major problems
research.
An
parametric
parametric
or
in
communication
brought to scientific
investigator performing a prob-
two groups has no
estimation, of
principles
only because he doesn't do,
The con-
well be responsible for
that statistical analysis has
in
methods of
in-
guesswork
other characteristics of a hypothetical parent
The confusion between estimations and contrasts. To be applied for evaluating comd.
The
in direct
sample available, however, the estimating
abilistic contrast of
purposes.
form the permu-
to
inference for contrasts. With only one
have had on scientific communication.
parisons, the indirect
no second
and other arrangements used
statistical
inves-
the is
(called assumptions) about the distribution
when
puters
ference
statistical
a necessity, not a calculational lux-
tigator has only
desk (or hand-
prob-
a
abilistic contrast.
sometimes formidable calculations; the second
a
Republican than
statistically significantly
clinical
wanted; the second was
scientists
what statisticians offered. The era
pooled vari-
acts of sampling,
infinite
was what
that
mathematically perfect distribu-
populations,
because
he
is
and
uses
statistical
interest
indirect
inference
know anything
else to
computationally
con-
original goal of the indirect in-
strained, or because he has received misleading
was parametric estimation, The inferential tactics
advice. Conversely, an investigator performing
not probabilistic contrast.
a parametric estimation
were
intended for political poll-takers,
cannot apply direct principles of inference and
ferential
strategy
initially
market research analysts, and other people
who
attempt to estimate the "parameters" of a popu-
'
say
339
could never be verified; and a willingness to
estimation from single samples;
an acceptance of
|>
them
display of statistical data
we perform a parametric estimation. If we find that the Republican preference is 44% among 150 clinical epidemiologists and 56% among 150 molecular biologists and if we
formation of the parametric theories developed for inferential
i
summary and
lation
by
found
in a
tion.
drawn from the values random sample of that populawe take a random sample of 150
inferences single
Thus,
if
potential voters in Connecticut, find that 81 of
from
a
random sample
uses the indirect methods because they are the best (and the only) tactics at his disposal.
The confusion among
scientists
is
wide-
spread and can readily be seen from the
quency with which
indirect
fre-
parametric tech-
niques are applied to contrast the results of
Mathematical mystiques and
statistical strategies
roups that were not selected as random samples and that therefore permit
timations.
no parametric
The confusion among
es-
statisticians
also widespread and can readih
not samples, the
entered
because
research
the
venience and availability
to
it
this
term can be rejected either because
was provided
investigator. statisti-
the late G.
bv
Snedecor, one of America's leading
\V.
con-
their
of
the
Perhaps the most dramatic example oi cians' contusion
statisti-
cians and the principal instructor for a genera-
contemporary
tion ot
minds
the
ot
statistical consultants.
scientific
ments are intended
investigators,
to allow
sults that will provide valid
answers
questions. According to Snedecor 18
sample of observations which
is
to ,
to
In
re-
research
however, produce
a
will furnish esti-
mates of the parameters of the population
to-
gether with measures of the uncertainty of these
e.
Estimating a mean and
its
ror. In the rare circumstances in cal or
standard erwhich a clini-
epidemiologic investigator has obtained a
random sample
for estimating
parameter such as cal
a
mean,
a
kind
might be confused with the other
it
standard deviation, calculated for indi-
o\'
means of
vidual single samples, not for the
has been hanging around, awaiting a call to ac-
To avoid
tive duty.
away,
letting this old soldier fade
standard
the
deviation
of
sample,
a standard deviation.
standard error of
mean (singular). The philologic restoration
of error and the
the
from
transition
plural
the
mathematical you, as a
faith.
is
It
singular of
the
to
mean was accompanied by some
basic acts of
have only one sample and
vou assume
that
With
this faith,
such a process took place and
The conclusions, which represent tenets
of
results.
its
the funda-
inference
statistical
for
parametric estimation, can be stated as follows:
As
1.
the theoretical sampling process con-
tinues over and over, the
mean of
mean. Lacking the about
samples
reality of repeated
to tell us this true value,
we must
take a guess
There are convincing mathematical
it.
show
proofs to
that the best guess, i.e.. the best
estimate, for the populational
mean of our 2.
means
the
approach the true value of the populational
mean
be the
will
single available sample.
Although the mean of
that single
sample
provides the best estimate of the populational
mean, the variance of the single sample, calculated as S xx /n,
populational
each sample,
proof that
necessary, back into the original
you sam-
that a repetitive
pling process never occurred.
mental
enables
this faith that
realistic scientist, to forget that
peated this process over and over, restoring if
means
the
(plural) will be christened the
will
we can calculate a From this information, how likely are we to be right or wrong in estimating the true mean of the parent population? To answer this question, we begin a long chain of abstract reasoning. Suppose we drew another sample and calculated its mean and standard deviation. Now suppose we obtained yet another sample and found its mean and standard deviation. Now suppose we reIn the available
a
series oi samples. Besides, our old friend error
populational
the indirect statisti-
reasoning would go as follows.
mean and
because
you can then draw conclusions about
estimates"
too
is
it
clear to be used in statistical nomenclature or
experi-
comparisons oi
"the purpose of an experiment
means, but
call
group having
ol the
the standard deviation of the
is
use the word samples for groups that were
members
tion of the
be seen from
the frequency with which statisticians improperly
The value for the standard deviameans would need a name. We could
those means.
frame (or population) before the next sample
can
was drawn.
variance
is
not the best estimator of the
is
With
variance.
demonstrated
be
that
mathematical
a
too complex to be
shown
here,
it
populational
the
|
As
the process of repeated sampling con-
tinued,
we would
obtain a series of means, one
for each sample. Let us set
of
now concentrate on
that
sample means and think of them as
though
were the individual values
lection ot
of the set
i
01
.
In fact, ins
let
is
best estimated as S xx /(n
aspect of statistical inference so
many
statistical
s
= V^(Xi -
[This
the reason that
textbooks and programmed
2
x) /(n
the
more
mean
the
denominator.
and the standard deviation of
1).
calculators determine the standard deviation as
in a col-
us calculate the
is
-
planation
intuitively
—
that
n
-
"logical" value of n
The commonly
-
1
I
1), rather than using!
is
the
cited
"degrees
in
ex-
of
<
Problems
freedom"
data
the
in
—
statistogenic confusion.
The
doesn't explain. n
-
is
It is
real
in the
another source of
now invoke
butions. According to this principle,
reason
the
the
that
is
popu-
offers a better estimate of the value in the
With another mathematical proof
3.
be spared here,
will
it
mean
the old principle of Gaussian distri-
mean
that
means found
more
that
you
Re-phrased
this
ner, this statement says there
in a
value for the standard deviation in
that the true populational
sample can be used to estimate the
x
calculated
s
error
is
means
as-
those hypothetical samples. With
in all
as
VS xx /(n -
1),
standard
the
of the
where
,
The appearance of the square root of
highh accurate estimations. In order to halve the standard error, we must quadruple the sample size. For example, let us consider the standard error of the a
random sample of 150
54
Consider the following array of survival rates for an ordinal age partition of patients with a particular disease: below age 31, 30%; age 31-54, 65%; age 46-60, 67%; above age 60, 29%. In this nonrates.
monotonic sequence of survival rates, the curve is trapezoidal (rising and falling), with survival rates worst in the two extreme age groups, and best in the middle groups. Since we had no advance beliefs about the prognostic distinctions of advancing age (particularly below age 60), the absence of monotonicity is not surprising or contradictory to
what had been
expected.
The property
of
can
be deter-
mined by simple inspection of the array of target rates
in
the
the
tively,
arranged
ordinally
values
for
strata.
Alterna-
adjacent target rates can
be subtracted from one another. In a monotonic stratification,
increments will be
the
all
positive
and a reversal in sign will denote a non-monotonic partition. Thus, for the three stratifications cited in the first paragraph of this or all negative,
section, the increments of rate were, respectively:
-15% and -25% for the monotonically decreasing partition; 4%, 5%, and 7% for the monotonically increasing partition; and -22%, +5%, and -39% for the
one that was non-monotonic.
test
for
monotonicity of gradient
particularly important ate,
when
is
a metric vari-
such as ape, height, or blood pressure, strata. into dichotomous
partitioned
is
With
a dichotomous
split,
any substantial
difference in rates will create a gradient,
Had we performed such a polychotomous partition we might have found three.
the following results: short, 10/40 (25%); below medium, 8/120 (7%); above medium, 7/20 (35%); and tall, 5/20 (25%). Our idea about a falling gradient would have been erroneous. Therefore, to avoid
misleading conclusions about the existence of gradients, a metric variate should always
be checked for before
A
it is
example
striking
omous
the report 14
The
Table 8
(pages 802-803)
stratifications
cholesterol,
body weight,
i
de».
had
t
then conclude that the target rate
nes with an increase in height. If .
cifically
we
checked for monotonicity
(
factors"
UGDP
in
report
age, blood pres-
glucose,
relative
and serum creatinine). All of these variates were split dichotomously, according to "cutting points" that were "arbitrarily selected." The target rates were listed for the two strata of each variate, but no data were presented to show whether the rates rose or fell monotonically
manner expected
in the
of the bio-
logic gradient for a "risk factor." In the absence
a test for monotonicity in these variates, the reader (and possibly the investigators) can have of
with a
would be discerned adequate polychotomous par-
effect that
scientifically
tition.
b.
Total gradient.
The
total gradient in
the target rate for a partition
is
the differ-
ence between the highest and lowest rates found in the individual strata. Assuming
distinctiveness,
For example, suppose the target rate in is 30/200 (15%). Suppose we now divide the population dichotomously according to height and find the rates of 10/40 (25%) for the shorter group and 20/160 (13%) for the taller group. We
dichot-
contained
visual acuity,
gradient has been specificially checked in
population
"risk
the
of
blood
fasting
is
study of diabetes
for
included seven metric variates sure,
that
a
UGDP
the
of
mellitus.
been
polychotomous rather than dichotomous
unsatisfactory
of
partitions for metric variates
conclusion about monotonicity unless the
partition.
polvchotomous partition
its
expressed in dichotomous form.
and the investigator may draw a spurious
a
more than
ordinal strata, and preferably
no idea of the true
A
have
divided the population into at least three
in
monotonicity
we would
of the gradient, however,
all
other numerical requirements have
fulfilled (for
modicum size, statistical when appropriate,
and,
monotonicity), we would prefer a partition with a large total gradient to one with a smaller gradient. If
a partition contained a dichotomous
split
performed in search of single "risk two strata would be regarded
factors," the
—
with neither stratum demarcating a significant risk factor unas essentially trivial
less
—
the gradient between them was
ciently high.
The
high" value for
suffi-
choice of a "sufficiently
this
gradient
is
arbitrary,
417
Evaluation of a prognostic stratification
10%
but a value of at least
seems reason-
tional ones:
>
able.
arterial
report 14
provides an obvious
also
example of the inappropriate application of the term "risk factors" to dichotomous partitions of strata that produced only minor or even trivial gradients in their target rates. Because the UGDP's Table 8 did not list the numbers of patients involved in the numerators or totals of the cited rates, the actual "risk" of the "risk factors"
death
not apparent in that table. Using the
is
of calculation described elsewhere, 8 I
method
have deter-
mined 8 the appropriate numbers and target
rate
percentages for the "selected baseline characteristics"
shown here If
we
in
UGDP. The
by the
reported
Table
as
two of the
"cardiovascular
strata labelled
factors"
risk
—
>
ECG
pectoris, lesterol.
The
data analysts created a union angina digitalis,
hypertension,
and elevated chocluster were also trivial,
abnormality,
results of this
producing a cardiovascular death gradient of only
9.0%. Omitted from this cluster were two strata that are shown in Table I to be more substantial cardiovascular
(gradient
factors:
risk
arterial
calcification
14.2%) and serum creatinine
>
1.5
mg./lOO ml. (gradient 10.3%). Another omission from the UGDP's combination of "cardiovascular risk factors" was Age > 55, which also had a higher cardiovascular risk gradient (9.5%) than the selected cluster.
With one exception,
of the strata that have producing significant or trivial gradients for rates of cardiovascular death produced correspondingly significant or trivial gradients when the target event was total deaths.
just
been described
carefully selected cluster of eight "base-
risk
included
factors"
four
for
cardiovascular that
strata
were
death thus
significant
factors (digitalis, angina pectoris, significant
risk
ECG
abnormality, and arterial calcification), one stra-
tum
that
two
strata
300 mg./
with cardiovascular death gra-
UGDP
strata:
line
The
stratum with the higher body weight.
UGDP's
rather than detrimental in this population.
7.1% and 7.4%. In forming a clusdesignated as one or more cardiovascular risk five
in the
hyper-
dients of only
of
vascular deaths, as well as total deaths, was lower
definite
and serum cholesterol
factors, the
factor,"
I indi-
essentially trivial in this population,
ter
of
I,
"risk
UGDP
as indicative
100 ml.
being associated
9.5% and 14.2%. The body weight however, was bizarre. Its gradient was only 3.9% and besides, the rate of cardioTable
by the
are
tension present
—were
rea-
Table
10%
of a distinct risk factor, the data of
cate that
were
"risk factors"
sonable additions to the cluster since they were associated with respective gradients, as shown in
was borderline significant (age > 55), that were trivial (hypertension and elevated cholesterol), and one stratum (relative body weight > 1.25) that was actually beneficial
results
I.
regard a gradient of
body weight The first and
years, relative calcification.
augmented
third of these
UGDP
The
> 55
age
and
1.25,
Isometry of clusters. As noted in our
c.
previous discussion, 10 the strata combined
have esWithout this isometry, a cluster would be
in a multivariate cluster should sentially similar target rates.
attention to
an indiscriminate conglomerate of heterogeneous groups, rather than a scientifically meaningful aggregation. An excellent example of scientific attention to isometry in clusters is provided in the staging system for breast cancer developed by Cutler and Myers. 5 6 These -
authors stratified patients according to a
number
of diverse risk factors, and formed "stages" by clustering the groups that had similar survival rates.
large
then
all
An
as
excellent
example of the neglect of
scientific principle is
risk
lar
The
formed
cluster"
original
this
provided in the "cardiovascu-
UGDP
in
the
cluster
UGDP
of five
reports. 14
strata
con-
20/200, which
tained an admixture of factors with cardiovascular
had a gradient of 4.9% for cardiovascular deaths,
death rates that can be noted in Table I to range from 33.3% (ECG abnormality) to 12.2% (hypertension). In the augmented cluster of eight strata, the corresponding death rates range from 33.3% to
The exception was
visual acuity
but 11.4% for total deaths.
The
the relationship of visual acuity
<
latter gradient for
and
total deaths
was higher than the corresponding gradient for hypertension, elevated cholesterol, and the UGDP's cardiovascular cluster.
In a subsequent report, 4 the UGDP group extended its cardiovascidar risk cluster to a union of eight rather
than five "risk factors." This union
included the previous five strata and three addi-
body weight > 1.25). The inmanner in which the "risk factors" were originally selected was magnified when the cardiovascular risk cluster was later stratified, in a subsequent UGDP report, 4 for patients who had
5.7%
(relative
discriminate
0,
1,
2,
factors."
°The values for arterial calcification have been listed according to the correction later reported by the UGDP group. 1 '
3,
4,
With
5,
or
this
6 of the cited eight "risk
type of stratification
—based
neither on biologic concordance nor on target rate
isometry
—
a patient with the single negative
factor of elevated
risk-
body weight would be placed
i
The
analysis of multiple variables
L'GDP
Stratification of "risk factors" in the
le I.
report' Total deaths
No. of Variate
Partition
<
Age
Pis.
X nnibcr and
X umber and rate
Gradient
rate
%
Gradient
55 55
449 374
22 (4.9 67 (17.9%)
13.0%
14 (3.1%) 47 (12.6%)
9.5%
Sex
Male Female
229 594
38 (16.6%) 51 (8.6%)
8.0%
25 (iO.9%) 36 (6.1%)
4.8%
Race
White Nonwhite
435 388
59 (13.6%)
5.9%
41
(9.4%) (5.2%)
4.2%
30
Absent
552 254
48 (8.7%) 38 (15.07c)
28 (5.1%) 31 (12.2%)
7.1%
762 46
69
785 47
74
Absent
777
74
Present
33
13 (39.4%;
697 108
71 (10.2%)
> 300 None One or more
411 361
27
(6.6%) 55 (i5.2%)
8.6%
13
(3.2%)
9.0%
<
272 547
21
(7.7%) 67 (J2.2%;
4.5%
13 (4.8%) 48 (8.8%) 44 (12.2%)
4.0%
>
110 110
< >
1.25
365 458
52 (14.0%) 37 (8.1%)
5.9%
1.25
20/200 20/200
725 41
76 (10.5%) 9 (21.9%)
<
1.5
781
>
1.5
18
>
1).
Ii\
finite
pertension
I'n sent
History of digitalis use
No Yes
No
History of angina pectoris Significant
Y,
ECG
abnormality
Cluster of
CV
risk
factors
Fasting blood glucose
Relative
body weight
>
Visual acuity
< Serum
s
.4). In addico-morbidity
nostic
disease
in the bivariate stratification of co-
tion,
morbid
G.I. disease vs.
symptom
stage, the
gradient for co-morbid G.I. disease
total
becomes
(P
statistically indistinct
>
.35)
when
tested within the category indolent,
which
is
From
Column these
1 of that table.
results,
we would
suspect
that the most biologically effective of the
three bivariate stratifications
is
the one con-
taining prognostic co-morbidity vs. symp-
tom
stage. This suspicion
is
supported by
Additional tactics
Table
II.
Chi-square results" for stratifications shoicn Co-morbid
'
Total
G.I. disease
symptom
vs.
stage
2
in
in
prognostic
Table
I
Prognostic co-morbidity
symptom
vs.
441
st ratification
Prognostic co-morbidity vs.
co-morbid G.
stage
I.
disease
X Value
d.f.
P
Value
d.f.
P
Value
d.f.
P
27.53
8